Michael Colella, Christopher Williams, Jonathan Williams
MCSA 32003 Marketing Analytics
Assignment 3: Build Brand Delights/Disappointment maps using Social Media Data

1. Choose any product category of your choice. Download Twitter data for all the leading brands in that product category.

return to top

Streaming TV Service
@YouTubeTV
@hulu
@Philo
@Sling
@AppleTV
@DisneyTVA
@ItsOnATT
@fuboTV
https://www.techhive.com/article/3211536/best-streaming-tv-service.html

Since the team wanted to achieve consistent results, the output was retrieved once and stored as csv files. Therefore, the commented code is the original code use to pull the data, with the import CSV lines of code used throughout the week.

In [1]:
import pandas as pd
from twitter_scraper import get_tweets
In [2]:
# %%time
# #YouTubeTV Hashtag Scrape

# YouTubeTV = [tweet['text'] for tweet in get_tweets('#YouTubeTV', pages=100)]
In [3]:
# YouTubeTVdf = pd.DataFrame(YouTubeTV)
# YouTubeTVdf.to_csv('C:/Users/1906g/Desktop/randomfolder/youtube.csv')
In [4]:
# %%time
# #Hulu Hashtag Scrape

# Hulu = [tweet['text'] for tweet in get_tweets('#hulu', pages=100)]
In [5]:
# Huludf = pd.DataFrame(Hulu)
# Huludf.to_csv('C:/Users/1906g/Desktop/randomfolder/Hulu.csv')
In [6]:
# %%time
# #Philo Hashtag Scrape

# Philo = [tweet['text'] for tweet in get_tweets('#Philo', pages=100)]
In [7]:
# Philodf = pd.DataFrame(Philo)
# Philodf.to_csv('C:/Users/1906g/Desktop/randomfolder/Philo.csv')
In [8]:
# %%time
# #Sling Hashtag Scrape

# Sling = [tweet['text'] for tweet in get_tweets('#Sling', pages=100)]
In [9]:
# Slingdf = pd.DataFrame(Sling)
# Slingdf.to_csv('C:/Users/1906g/Desktop/randomfolder/Sling.csv')
In [10]:
# %%time
# #AppleTV Hashtag Scrape

# AppleTV = [tweet['text'] for tweet in get_tweets('#AppleTV', pages=100)]
In [11]:
# AppleTVdf = pd.DataFrame(AppleTV)
# AppleTVdf.to_csv('C:/Users/1906g/Desktop/randomfolder/AppleTV.csv')
In [12]:
# %%time
# #DisneyTV Hashtag Scrape

# DisneyTV = [tweet['text'] for tweet in get_tweets('#DisneyTV', pages=100)]
In [13]:
# DisneyTVdf = pd.DataFrame(DisneyTV)
# DisneyTVdf.to_csv('C:/Users/1906g/Desktop/randomfolder/DisneyTV.csv')
In [14]:
# %%time
# #ItsOnATT Hashtag Scrape

# ItsOnATT = [tweet['text'] for tweet in get_tweets('#ItsOnATT', pages=100)]
In [15]:
# ItsOnATTdf = pd.DataFrame(ItsOnATT)
# ItsOnATTdf.to_csv('C:/Users/1906g/Desktop/randomfolder/ItsOnATT.csv')
In [16]:
# %%time
# #fuboTV Hashtag Scrape

# fuboTV = [tweet['text'] for tweet in get_tweets('#fuboTV', pages=100)]
In [17]:
# fuboTVdf = pd.DataFrame(fuboTV)
# fuboTVdf.to_csv('C:/Users/1906g/Desktop/randomfolder/fuboTV.csv')
In [18]:
# import CSV of data pulled once to maintain consistency
YouTubeTV = pd.read_csv('youtube.csv', header=None, skiprows = 1)[1].values.tolist()
Hulu = pd.read_csv('Hulu.csv', header=None, skiprows = 1)[1].values.tolist()
Philo = pd.read_csv('Philo.csv', header=None, skiprows = 1)[1].values.tolist()
Sling = pd.read_csv('Sling.csv', header=None, skiprows = 1)[1].values.tolist()
AppleTV = pd.read_csv('AppleTV.csv', header=None, skiprows = 1)[1].values.tolist()
DisneyTV = pd.read_csv('DisneyTV.csv', header=None, skiprows = 1)[1].values.tolist()
ItsOnATT = pd.read_csv('ItsOnATT.csv', header=None, skiprows = 1)[1].values.tolist()
fuboTV = pd.read_csv('fuboTV.csv', header=None, skiprows = 1)[1].values.tolist()

2. Generate Document-Term-Frequency matrices (either TDF, DTF, TFIDF) after pre-processing data for stop word removal, special character removal, number removal, and case conversion.

return to top

In [19]:
import nltk
nltk.download('stopwords')
# import spacy
import unicodedata
import re
from nltk.corpus import wordnet
import collections
from nltk.tokenize.toktok import ToktokTokenizer
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jonat\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
In [20]:
tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')
In [21]:
def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    if bool(soup.find()):
        [s.extract() for s in soup(['iframe', 'script'])]
        stripped_text = soup.get_text()
        stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
    else:
        stripped_text = text
    return stripped_text

#def lemmatize_text(text):
#    text = nlp(text)
#    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
#    return text

def simple_porter_stemming(text):
    ps = nltk.porter.PorterStemmer()
    text = ' '.join([ps.stem(word) for word in text.split()])
    return text

def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text


def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-Z0-9\s]|\[|\]' if not remove_digits else r'[^a-zA-Z\s]|\[|\]'
    text = re.sub(pattern, '', text)
    return text


def remove_stopwords(text,stopwords=stopword_list):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    filtered_tokens = [token for token in tokens if token.lower() not in stopwords]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

def normalize_corpus(corpus, html_stripping=True, 
                     accented_char_removal=True, text_lower_case=True, 
                     text_stemming=False, text_lemmatization=False, 
                     special_char_removal=True, remove_digits=True,
                     stopword_removal=True, stopwords=stopword_list):
    
    normalized_corpus = []
    # normalize each document in the corpus
    for doc in corpus:

        # strip HTML
        if html_stripping:
            doc = strip_html_tags(doc)

        # remove extra newlines
        doc = doc.translate(doc.maketrans("\n\t\r", "   "))

        # remove accented characters
        if accented_char_removal:
            doc = remove_accented_chars(doc)

        # lemmatize text
        if text_lemmatization:
            doc = lemmatize_text(doc)

        # stem text
        if text_stemming and not text_lemmatization:
            doc = simple_porter_stemming(doc)

        # remove special characters and\or digits    
        if special_char_removal:
            # insert spaces between special characters to isolate them    
            special_char_pattern = re.compile(r'([{.(-)!}])')
            doc = special_char_pattern.sub(" \\1 ", doc)
            doc = remove_special_characters(doc, remove_digits=remove_digits)  

        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)

         # lowercase the text    
        if text_lower_case:
            doc = doc.lower()

        # remove stopwords
        if stopword_removal:
            doc = remove_stopwords(doc,stopwords=stopwords)

        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)
        doc = doc.strip()
            
        normalized_corpus.append(doc)
        
    return normalized_corpus
In [22]:
#YouTube Corpus
corpus_YouTubeTV = normalize_corpus(YouTubeTV)
In [23]:
#Hulu Corpus
corpus_Hulu = normalize_corpus(Hulu)
In [24]:
#Philo Corpus
corpus_Philo = normalize_corpus(Philo)
In [25]:
#Sling Corpus
corpus_Sling = normalize_corpus(Sling)
In [26]:
#AppleTV Corpus
corpus_AppleTV = normalize_corpus(AppleTV)
In [27]:
#DisneyTV Corpus
corpus_DisneyTV = normalize_corpus(DisneyTV)
In [28]:
#ItsOnATT Corpus
corpus_ItsOnATT = normalize_corpus(ItsOnATT)
In [29]:
#fuboTV Corpus
corpus_fuboTV = normalize_corpus(fuboTV)

TF-IDF for leading brands in Streaming TV Service

YouTubeTV - TFIDF

In [30]:
#YouTube TD-IDF
YouTubeTV_vec = TfidfVectorizer(min_df=0., max_df=1., norm='l2',
                     use_idf=True, smooth_idf=True)
YouTubeTV_matrix = YouTubeTV_vec.fit_transform(corpus_YouTubeTV)
YouTubeTV_matrix = YouTubeTV_matrix.toarray()

vocab = YouTubeTV_vec.get_feature_names()
YouTubeTV_TFIDF = pd.DataFrame(np.round(YouTubeTV_matrix, 2), columns=vocab)
YouTubeTV_TFIDF.head(5)
#https://www.youtube.com/watch?v=WN18JksF9Cg
#Count Vectorizer Vs TF-IDF for Text Processing
Out[30]:
actually added adidas allow alternate alticeusa amazinggracegeriminelli app appletv asked ... work workshop xbox xfinity yall yesterdays youtube youtubechannel youtubetips youtubetv
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0.00 0.09
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0.00 1.00
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.13 0.00 0.00 0.14
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.22 0.25 0.25 0.06
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0.00 0.14

5 rows × 256 columns

Hulu - TFIDF

In [31]:
#Hulu TD-IDF

Hulu_vec = TfidfVectorizer(min_df=0., max_df=1., norm='l2',
                     use_idf=True, smooth_idf=True)
Hulu_matrix = Hulu_vec.fit_transform(corpus_Hulu)
Hulu_matrix = Hulu_matrix.toarray()

vocab = Hulu_vec.get_feature_names()
Hulu_TFIDF = pd.DataFrame(np.round(Hulu_matrix, 2), columns=vocab)
Hulu_TFIDF.head(5)
Out[31]:
abematv account action ads adultswim adventure alexa alexandros amazon amazonprime ... webseries website willandgrace world writer writers xbox xboxone youre youtube
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.35 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 ... 0.0 0.0 0.0 0.2 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 193 columns

Philo - TFIDF

In [32]:
#Philo TD-IDF

Philo_vec = TfidfVectorizer(min_df=0., max_df=1., norm='l2',
                     use_idf=True, smooth_idf=True)
Philo_matrix = Philo_vec.fit_transform(corpus_Philo)
Philo_matrix = Philo_matrix.toarray()

vocab = Philo_vec.get_feature_names()
Philo_TFIDF = pd.DataFrame(np.round(Philo_matrix, 2), columns=vocab)
Philo_TFIDF.head(5)
Out[32]:
acoustic age aidoneus album amazon amour anticipation antiquite arendt artiste ... version via vie vivre vrai yahoo yeux youtube youtubespacepar zeus
0 0.00 0.00 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.00 0.14 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0
1 0.00 0.00 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.00 0.00 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0
2 0.00 0.27 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.00 0.23 0.0 0.0 0.0 0.0 0.0 0.27 0.0 0.0
3 0.22 0.00 0.0 0.22 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.22 0.00 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0
4 0.00 0.00 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.00 0.00 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0

5 rows × 283 columns

Sling - TFIDF

In [33]:
#Sling TD-IDF

Sling_vec = TfidfVectorizer(min_df=0., max_df=1., norm='l2',
                     use_idf=True, smooth_idf=True)
Sling_matrix = Sling_vec.fit_transform(corpus_Sling)
Sling_matrix = Sling_matrix.toarray()

vocab = Sling_vec.get_feature_names()
Sling_TFIDF = pd.DataFrame(np.round(Sling_matrix, 2), columns=vocab)
Sling_TFIDF.head(5)
Out[33]:
accidents arm arts baby babyboy babycarrier babygirl babysling babywearing bag ... steps sure topbeauty totxeya treatment tv twitter use waistbaby walkabout
0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.00 0.0
1 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.00 0.0
2 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.00 0.0
3 0.0 0.0 0.0 0.55 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.14 0.0 0.22 0.0
4 0.2 0.2 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.2 0.0 0.00 0.0 0.00 0.0

5 rows × 102 columns

AppleTV - TFIDF

In [34]:
#AppleTV TD-IDF

AppleTV_vec = TfidfVectorizer(min_df=0., max_df=1., norm='l2',
                     use_idf=True, smooth_idf=True)
AppleTV_matrix = AppleTV_vec.fit_transform(corpus_AppleTV)
AppleTV_matrix = AppleTV_matrix.toarray()

vocab = AppleTV_vec.get_feature_names()
AppleTV_TFIDF = pd.DataFrame(np.round(AppleTV_matrix, 2), columns=vocab)
AppleTV_TFIDF.head(5)
Out[34]:
actu aggressively alexa alone already amazing amazon amazonappletv amazonfiretv amazons ... wanted watch watching week wer whoa worth worthy writing yet
0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.0 0.21 0.0 ... 0.00 0.00 0.0 0.21 0.0 0.0 0.00 0.0 0.0 0.0
1 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.0 0.00 0.0 ... 0.00 0.00 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.0
2 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.0 0.00 0.0 ... 0.00 0.00 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.0
3 0.0 0.00 0.0 0.0 0.35 0.0 0.0 0.0 0.00 0.0 ... 0.35 0.23 0.0 0.00 0.0 0.0 0.35 0.0 0.0 0.0
4 0.0 0.28 0.0 0.0 0.00 0.0 0.0 0.0 0.00 0.0 ... 0.00 0.00 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.0

5 rows × 260 columns

DisneyTV - TFIDF

In [35]:
#DisneyTV TD-IDF

DisneyTV_vec = TfidfVectorizer(min_df=0., max_df=1., norm='l2',
                     use_idf=True, smooth_idf=True)
DisneyTV_matrix = DisneyTV_vec.fit_transform(corpus_DisneyTV)
DisneyTV_matrix = DisneyTV_matrix.toarray()

vocab = DisneyTV_vec.get_feature_names()
DisneyTV_TFIDF = pd.DataFrame(np.round(DisneyTV_matrix, 2), columns=vocab)
DisneyTV_TFIDF.head(5)
Out[35]:
abc ada adobe afterlife ahhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh album andrews animalkingdom animationguild anyone ... whilst womeninanimation write yall yang yay yeniahval youtube yuk zoom
0 0.0 0.0 0.0 0.0 0.2 0.0 0.0 0.00 0.0 0.00 ... 0.00 0.0 0.0 0.0 0.0 0.2 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.23 0.0 0.23 ... 0.23 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.00 ... 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.00 ... 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.00 ... 0.00 0.0 0.0 0.2 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 293 columns

ItsOnATT - TFIDF

In [36]:
#ItsOnATT TD-IDF

ItsOnATT_vec = TfidfVectorizer(min_df=0., max_df=1., norm='l2',
                     use_idf=True, smooth_idf=True)
ItsOnATT_matrix = ItsOnATT_vec.fit_transform(corpus_ItsOnATT)
ItsOnATT_matrix = ItsOnATT_matrix.toarray()

vocab = ItsOnATT_vec.get_feature_names()
ItsOnATT_TFIDF = pd.DataFrame(np.round(ItsOnATT_matrix, 2), columns=vocab)
ItsOnATT_TFIDF.head(5)
Out[36]:
abqecondev abqfilmoffice abqtech action asia att attemployee attonlocation bakugan bekind ... west wihff wild witch women wondering words working yall years
0 0.00 0.0 0.0 0.0 0.0 0.00 0.21 0.0 0.00 0.0 ... 0.21 0.0 0.0 0.21 0.0 0.0 0.21 0.0 0.0 0.0
1 0.00 0.0 0.0 0.0 0.0 0.38 0.00 0.0 0.00 0.0 ... 0.00 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.0
2 0.17 0.0 0.0 0.0 0.0 0.00 0.00 0.0 0.22 0.0 ... 0.00 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.0
3 0.18 0.0 0.0 0.0 0.0 0.00 0.00 0.0 0.00 0.0 ... 0.00 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.0
4 0.00 0.0 0.0 0.0 0.0 0.00 0.00 0.0 0.00 0.0 ... 0.00 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.0

5 rows × 169 columns

fuboTV - TFIDF

In [37]:
#fuboTV TD-IDF

fuboTV_vec = TfidfVectorizer(min_df=0., max_df=1., norm='l2',
                     use_idf=True, smooth_idf=True)
fuboTV_matrix = fuboTV_vec.fit_transform(corpus_fuboTV)
fuboTV_matrix = fuboTV_matrix.toarray()

vocab = fuboTV_vec.get_feature_names()
fuboTV_TFIDF = pd.DataFrame(np.round(fuboTV_matrix, 2), columns=vocab)
fuboTV_TFIDF.head(5)
Out[37]:
able affiliates agreement amazing announced baked best bestfriends bring broadcast ... use via viacomcbss watch watching watchnow weeks would year youtubetv
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.00 0.0 0.0 0.2 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.00 0.0 0.0 0.2 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.00 0.0 0.0 0.2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.36 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.30 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 147 columns

3. Generate Wordclouds for each of the Brands separately. Do they reveal anything? Summarize insights.

return to top

In [38]:
import nltk
from nltk.corpus import webtext
from nltk.probability import FreqDist
from wordcloud import WordCloud
import matplotlib.pyplot as plt
In [39]:
def CreateWordCloud(STOPWORDS, corpus, tilte = 'title'):
    raw_string = ''.join(corpus)
    no_links = re.sub(r'http\S+', '', raw_string)
    no_unicode = re.sub(r"\\[a-z][a-z]?[0-9]+", '', no_links)
    no_special_characters = re.sub('[^A-Za-z ]+', '', no_unicode)
    
    words = no_special_characters.split(" ")
    words = [w for w in words if len(w) > 2]  # ignore a, an, be, ...
    words = [w.lower() for w in words]
    words = [w for w in words if w not in STOPWORDS]
    
    wc = WordCloud(width = 600, height = 400, background_color="white", max_words=2000)
    clean_string = ','.join(words)
    wc.generate(clean_string)
    
    plt.figure(figsize = (8, 8), facecolor = None) 
    plt.imshow(wc)
    plt.title(tilte, size=50)
    plt.axis("off") 
    plt.tight_layout(pad = 0)
    plt.show()

Please note, the interpretation of these word cloud groupings is purely subjective.

YouTube Word Cloud

The YouTubeTV word cloud reveals some type of relationship with HBO Max.

After further research, there are news headlines that say HBO and HBO Max are headed for YouTubeTV. Clearly this created lots of buzz on twitter as revealed in the word cloud.

In [40]:
#YouTube Word Cloud
STOPWORDS = nltk.corpus.stopwords.words('english')
STOPWORDS = STOPWORDS + ['youtubetv', 'youtube', 'https', 'pictwittercom', 'pic']
CreateWordCloud(STOPWORDS,corpus_YouTubeTV,'YouTubeTV Cloud')

Hulu Word Cloud

Hulu’s word cloud reveals season two of something. There also appears to be popular shows such as sonic, good doctor, and breaking bad. Sonic appears multiple times.

After further research, it appears that at some point, Hulu landed the streaming rights to The Good Doctor. It clearly must have been worth it because the Good Doctor continues to appear in the twitter conversation. There are several Sonics on Hulu. There is Sonic Boom and there is Sonic The Hedgehog. This may explain why sonic appeared several times in the word cloud.

In [41]:
#Hulu Word Cloud
STOPWORDS = nltk.corpus.stopwords.words('english')
STOPWORDS = STOPWORDS + ['hulu', 'https', 'pictwittercom', 'pic']
CreateWordCloud(STOPWORDS,corpus_Hulu,'Hulu Cloud')

Philo Word Cloud

Much of Philo’s word cloud appear to be non-English words. Interestingly, some of the main English words in the word cloud are language, communication and age.

After further research, it appears that Philo offers channels similar to the other streaming services, however they are focus on lifestyle channels. The non-English words are likely based on the time the tweets were pulled. There was likely a popular lifestyle show that generated buzz in a language different than English.

In [42]:
#Philo Word Cloud
STOPWORDS = nltk.corpus.stopwords.words('english')
STOPWORDS = STOPWORDS + ['https', 'philo', 'philosophie', 'philosophie', 'learnet', 'labelles', 'est', 'tout', 'du', 'pic']
CreateWordCloud(STOPWORDS,corpus_Philo,'Philo Cloud')

Sling Word Cloud

During the time that the Tweets were pulled for Sling TV, there clearly was buzz around something related to a baby. Most of the keywords were cute child, carrierbaby, waistbaby, safe baby, and just baby.

After further research, the #sling hash tag is not only related to Sling TV. The #sling hastag is apparently dominated by a baby sling product. Clearly, at the time the tweets were pulled, there was a lot of chatter from twitter community using the #sling hash tag to communicate about babies and baby slings.

In [43]:
#Sling Word Cloud
STOPWORDS = nltk.corpus.stopwords.words('english')
STOPWORDS = STOPWORDS + ['sling', 'pictwittercom', 'https', 'http', 'pic']
CreateWordCloud(STOPWORDS,corpus_Sling,'Sling Cloud')

AppleTV Word Cloud

For AppleTV, there were lots of non-English keywords. There were also other brands associated with AppleTV tweets. A few of those keywords were Amazon, Alexa, Samsung, in addition to Smart TV. Compatible also appeared. This may reveal people making comparisons between AppleTV and compatible products.

Apple being the worldwide company that it is has a translation app. This may explain the universality of the words pulled in by their associated tweets. Also, during the time the tweets were pulled, there may have been lots of buzz from different languages.

In [44]:
#AppleTV Word Cloud
STOPWORDS = nltk.corpus.stopwords.words('english')
STOPWORDS = STOPWORDS + ['appletv', 'pictwittercom', 'http', 'pic']
CreateWordCloud(STOPWORDS,corpus_AppleTV,'AppleTV Cloud')

DisneyTV Word Cloud

DisneyTV’s word cloud has mostly a positive connotation. With words like great dominating the word cloud.

Being that Disney has many kid and teenage related programs, the word cloud fits their brand.

In [45]:
#DisneyTV Word Cloud
STOPWORDS = nltk.corpus.stopwords.words('english')
STOPWORDS = STOPWORDS + ['disneytv', 'disney', 'https', 'pictwittercom', 'bitly', 'pic']
CreateWordCloud(STOPWORDS,corpus_DisneyTV,'DisneyTV Cloud')

ItsOnATT Word Cloud

Other word clouds had several dominant keywords. However, ItsOnATT has a more balanced distribution of words associated with it.

Directtv, nbcuniversal, filmoffice, Netflix, richarbranson, elonmusk are some of the notable keywords. It reveals big name associations with AT&T.

In [46]:
#ItsOnATT Word Cloud
STOPWORDS = nltk.corpus.stopwords.words('english')
STOPWORDS = STOPWORDS + ['itsonatt', 'https', 'twittercom']
CreateWordCloud(STOPWORDS,corpus_ItsOnATT,'ItsOnATT Cloud')

fuboTV Word Cloud

The word cloud for fuboTV is interesting. It appears that there may be a very popular cooking show. The keywords that dominate the fuboTV word cloud are chocolate chip cookie related. For example, chewy, cookies, gooey, baked, chocolate, chip are scattered throughout the word cloud.

After researching the fuboTV hashtag, it is very sparse. If you do a little scrolling, you will quickly get to 2019 tweets. If you scroll a little more, you begin to see tweets from 2018. It was difficult to scroll back far enough to see cookie related tweets. Obviously the twitter scraper was able to go far enough back to identify this focus

In [47]:
#fuboTV Word Cloud
STOPWORDS = nltk.corpus.stopwords.words('english')
STOPWORDS = STOPWORDS + ['fubotv', 'https', 'pictwittercom', 'pic']
CreateWordCloud(STOPWORDS,corpus_fuboTV,'fuboTV Cloud')

4. Perform Topic modeling via Latent Dirchlet Allocation (LDA) on the DTF matrix. Extract top 10 topics. Interpret the topics. Are they Brand specific?

return to top

In [48]:
#LDA conceptual breakdown.
#https://www.youtube.com/watch?v=DWJYZq_fQ2A

# LDA SCRIPT
# https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0


from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import re

Please note that each set of topics is, indeed, Brand specific within their individual 10-topic output. The specific brand terms were added as stop words and are not part of the visualization for the 10 most common words as well as the LDA topic outputs. We have interpreted the first 5 topics in each set of outputs and have noted the brand specific terms that are related to the specific streaming service if relevant.

Please note, the interpretation of these topics is purely subjective.

LDA - YouTubeTV

In [49]:
YouTubeTV_LDA = YouTubeTV
YouTubeTV_LDA = pd.DataFrame(YouTubeTV_LDA)

# Remove punctuation
YouTubeTV_LDA[0] = YouTubeTV_LDA[0].map(lambda x: re.sub('[,\.!?]', '', x))
# Convert the titles to lowercase
YouTubeTV_LDA[0] = YouTubeTV_LDA[0].map(lambda x: x.lower())
# Print out the first rows of papers
YouTubeTV_LDA.head()
Out[49]:
0
0 on wednesday #hbomax will be available to #you...
1 we have #youtubetv
2 @youtubetv review: channel lineup dvr local ch...
3 how to make a youtube channel\n\neasy steps an...
4 @youtubetv @hbomax my internet package include...
In [50]:
# Join the different processed titles together.
long_string = ','.join(list(YouTubeTV_LDA[0].values))
# Create a WordCloud object
wordcloud = WordCloud(width = 600, height = 400, background_color="white", max_words=5000)
# Generate a word cloud
wordcloud.generate(long_string)
# Visualize the word cloud
wordcloud.to_image()
Out[50]:
In [51]:
# Load the library with the CountVectorizer method
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline

# Helper function
def plot_10_most_common_words(count_data, count_vectorizer):
    words = count_vectorizer.get_feature_names()
    total_counts = np.zeros(len(words))
    for t in count_data:
        total_counts+=t.toarray()[0]
    
    count_dict = (zip(words, total_counts))
    count_dict = sorted(count_dict, key=lambda x:x[1], reverse=True)[0:10]
    words = [w[0] for w in count_dict]
    counts = [w[1] for w in count_dict]
    x_pos = np.arange(len(words)) 
    
    plt.figure(2, figsize=(15, 15/1.6180))
    plt.subplot(title='10 most common words')
    sns.set_context("notebook", font_scale=1.25, rc={"lines.linewidth": 2.5})
    sns.barplot(x_pos, counts, palette='husl')
    plt.xticks(x_pos, words, rotation=90) 
    plt.xlabel('words')
    plt.ylabel('counts')
    plt.show()
    
# Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer(stop_words=STOPWORDS + ['youtubetv', 'youtube', 'https', 'pictwittercom'])

# Fit and transform the processed titles
count_data = count_vectorizer.fit_transform(YouTubeTV_LDA[0])

# Visualise the 10 most common words
plot_10_most_common_words(count_data, count_vectorizer)
In [52]:
import warnings
warnings.simplefilter("ignore", DeprecationWarning)
# Load the LDA model from sk-learn
from sklearn.decomposition import LatentDirichletAllocation as LDA
 
# Helper function
def print_topics(model, count_vectorizer, n_top_words):
    words = count_vectorizer.get_feature_names()
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([words[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
        
# Tweak the two parameters below
number_topics = 10
number_words = 10
# Create and fit the LDA model
lda = LDA(n_components=number_topics, n_jobs=-1)
lda.fit(count_data)
# Print the topics found by the LDA model
print("Topics found via LDA:")
print_topics(lda, count_vectorizer, number_words)
Topics found via LDA:

Topic #0:
autoedition hbomax video google guide purchased includes internet directly link

Topic #1:
deals close comcast 20 done 2020 get said completed suspended

Topic #2:
thanks woah restaurantstartup time midweeker unv1gr9imma track latest sing faithful

Topic #3:
tv even better hulu sap known individual primary switch carry

Topic #4:
channel make youtubetips vlog youtubechannel wwwprofiletreecom http hslrwzys5g steps easy

Topic #5:
streaming channels 349rn0h review dvr htpc buffly cordcutter lineup cordcutters

Topic #6:
missed last via lastdance espn sports netflix nba michaeljordan michael

Topic #7:
ft losangeles unitedmasters e9txxzzwibm6 beverlyhills 22 lafc longbeach adidas godsplan

Topic #8:
subscribers hbomax max 1499 month hbo per available wednesday restaurantstartup

Topic #9:
tcm hulu new may directv made verizon 27 sony properties

YouTubeTV Topics Interpretation:

  • Topic #0: This topic indicates binge watching sharktank.
  • Topic #1: This topic indicates missing covid 19 updates.
  • Topic #2: This topic revolves around the Michael Jordon Last Dance documentary.
  • Topic #3: This topic indicates some type of tcm subscriber volume. TCM = Turner Classic Movies.
  • Topic #4: This topic indicates some type of tutorial with words such as make, easy, steps.

This batch of topics is brand specific around YouTubeTV and TCM, Michael Jordan, and the SharkTank brand.

LDA - Hulu

In [53]:
Hulu_LDA = Hulu
Hulu_LDA = pd.DataFrame(Hulu_LDA)
# Remove punctuation
Hulu_LDA[0] = Hulu_LDA[0].map(lambda x: re.sub('[,\.!?]', '', x))
# Convert the titles to lowercase
Hulu_LDA[0] = Hulu_LDA[0].map(lambda x: x.lower())
# Print out the first rows of papers
Hulu_LDA.head()
Out[53]:
0
0 #hulu #rt\nhttp://3stepme/37t9
1 @gooddoctorabc on #hulu is a great series
2 folks: let's never grow up checking out soni...
3 ソウルグッドマン、ジェシー、マイク\n\n #あなたが推しすぎな3人\n#breakingb...
4 was watching #thegreat on #hulu and was ready ...
In [54]:
# Join the different processed titles together.
long_string = ','.join(list(Hulu_LDA[0].values))
# Create a WordCloud object
wordcloud = WordCloud(width = 600, height = 400, background_color="white", max_words=5000)
# Generate a word cloud
wordcloud.generate(long_string)
# Visualize the word cloud
wordcloud.to_image()
Out[54]:
In [55]:
# Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer(stop_words=STOPWORDS + ['hulu', 'https', 'pictwittercom',
                                                          'e3', '81', 'aa', '89', '82'])

# Fit and transform the processed titles
count_data = count_vectorizer.fit_transform(Hulu_LDA[0])

# Visualise the 10 most common words
plot_10_most_common_words(count_data, count_vectorizer)
In [56]:
# Tweak the two parameters below
number_topics = 10
number_words = 10
# Create and fit the LDA model
lda = LDA(n_components=number_topics, n_jobs=-1)
lda.fit(count_data)
# Print the topics found by the LDA model
print("Topics found via LDA:")
print_topics(lda, count_vectorizer, number_words)
Topics found via LDA:

Topic #0:
great series best original programming think gooddoctorabc watching http 37t9

Topic #1:
huluaccount account sellhulu interested dm selling cashapp watching http rt

Topic #2:
next amazon netflix http primevideo bd bc サブスクライブ サブスク アマプラ

Topic #3:
amazonprime netflix 2010 marvel documentary download drama entertainment forever horror

Topic #4:
watching 日テレ idol1曲 home live メドレーver でlysdvdの うちでアガれ 地上波で嬉しい 流してくれるとは

Topic #5:
alexandros amznto tv fire までお気に入りのコンテンツを大画面で amazonprimevideo スポーツ abematv ニュース 35f2hyd

Topic #6:
nba2k20park complete bio nba2kleague nba2kcommunity cahoimfjr4a nba2k20 nba2k thumbnail avatarthelastairbender

Topic #7:
netflix ジェシー breakingbad ynoijv3pzw あなたが推しすぎな3人 amazonプライムビデオ ソウルグッドマン マイク watching http

Topic #8:
two season watching us reality norris binge birthday like nnnnnnnnnnoooooooooo

Topic #9:
野口衣織 湘南乃風 ナイスエラー 郡司恭子 ノイミー ノットイコールミー プロスタ bs日テレ 大谷映美里 新羅慎二

Hulu Topics Interpretation:

  • Topic #1: Topic 1 revolves around The Goldberg’s and Nintendo Switch.
  • Topic #2: Topic 2 is about watching Season 2 of Us.
  • Topic #3: Topic 3 is related to NBA2K live TV with Hulu.
  • Topic #4: Topic 4 revolves around Breaking Bad on Netflix.
  • Topic #5: Topic 5 is also related to Netflix.

This batch of topics is brand specific with association between Hulu and Netflix.

LDA - Philo

In [57]:
Philo_LDA = Philo
Philo_LDA = pd.DataFrame(Philo_LDA)
# Remove punctuation
Philo_LDA[0] = Philo_LDA[0].map(lambda x: re.sub('[,\.!?]', '', x))
# Convert the titles to lowercase
Philo_LDA[0] = Philo_LDA[0].map(lambda x: x.lower())
# Print out the first rows of papers
Philo_LDA.head()
Out[57]:
0
0 pourquoi serait-il au fond désirable d’être so...
1 observer le monde; c'est le comprendre #philos...
2 le langage et la communication dans la société...
3 click on the link and checkout the acoustic ve...
4 maintenant en direct épisode 2 de mon lab'orat...
In [58]:
# Join the different processed titles together.
long_string = ','.join(list(Philo_LDA[0].values))
# Create a WordCloud object
wordcloud = WordCloud(width = 600, height = 400, background_color="white", max_words=5000)
# Generate a word cloud
wordcloud.generate(long_string)
# Visualize the word cloud
wordcloud.to_image()
Out[58]:
In [59]:
# Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer(stop_words=STOPWORDS + ['https', 'philo', 'philosophie', 'philosophie', 'learnet', 'labelles', 'est', 'tout', 'du',
                                                          'de', 'le', 'la', 'et', 'les', 'en', 'vie', 'une', 'ce', 'ne'])

# Fit and transform the processed titles
count_data = count_vectorizer.fit_transform(Philo_LDA[0])

# Visualise the 10 most common words
plot_10_most_common_words(count_data, count_vectorizer)
In [60]:
# Tweak the two parameters below
number_topics = 10
number_words = 10
# Create and fit the LDA model
lda = LDA(n_components=number_topics, n_jobs=-1)
lda.fit(count_data)
# Print the topics found by the LDA model
print("Topics found via LDA:")
print_topics(lda, count_vectorizer, number_words)
Topics found via LDA:

Topic #0:
artiste nouvelle se moment michel debord réalise introduction art devenir

Topic #1:
youtube sur direct psychologie ateliers mon formation dream épisode nous

Topic #2:
choses toutes quatre racines connais citation qui ses larmes nhcbndu2zp

Topic #3:
monde youtube citation ea8naaor2j4 sagesse éditeur chaîne authenticité editer youtubespacepar

Topic #4:
bonne qu biggest mall two months butler autres judith loan

Topic #5:
status twittercom 1263580919740264448 platon congolesewife verite mène essentiel oasvc0lz2g communication

Topic #6:
plus ont chose descartes pense bien que twittercom status chacun

Topic #7:
youtube dans communication http amour vía société langage nouvel age

Topic #8:
sisters album link hit ground grandslamacoustic grandslam lockdownuknow fanlinkto sistersofmercy

Topic #9:
serait être pourquoi il philochemins 1263843557581959168 au partout désirable aise

Philo Topics Interpretation:

  • Philo’s topics are dominated by foreign language tweets.
  • However, Topic #5 is related to new language communication.
  • Topic #9 is related to the musical group Grand Slam and their song Sisters of Mercy.

LDA - Sling

In [61]:
Sling_LDA = Sling
Sling_LDA = pd.DataFrame(Sling_LDA)
# Remove punctuation
Sling_LDA[0] = Sling_LDA[0].map(lambda x: re.sub('[,\.!?]', '', x))
# Convert the titles to lowercase
Sling_LDA[0] = Sling_LDA[0].map(lambda x: x.lower())
# Print out the first rows of papers
Sling_LDA.head()
Out[61]:
0
0 眠ってしまった赤ちゃんの様子を見ながら、リングありスリングならばそうっとポーチをゆるめていき...
1 寄り添い抱きはおしりが沈みすぎると、ひざ裏に赤い痕が付く場合がありますので、注意深く観察しま...
2 寄り添い抱きをする場合、リングありスリングならスリングを掛けた側の手でテールを引き、適度に密...
3 high quality & eco-friendly baby carrier/baby ...
4 jason hated having a big sister #fractures #sl...
In [62]:
# Join the different processed titles together.
long_string = ','.join(list(Sling_LDA[0].values))
# Create a WordCloud object
wordcloud = WordCloud(width = 600, height = 400, background_color="white", max_words=5000)
# Generate a word cloud
wordcloud.generate(long_string)
# Visualize the word cloud
wordcloud.to_image()
Out[62]:
In [63]:
# Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer(stop_words=STOPWORDS + ['sling', 'pictwittercom', 'https', 'http'])

# Fit and transform the processed titles
count_data = count_vectorizer.fit_transform(Sling_LDA[0])

# Visualise the 10 most common words
plot_10_most_common_words(count_data, count_vectorizer)
In [64]:
# Tweak the two parameters below
number_topics = 10
number_words = 10
# Create and fit the LDA model
lda = LDA(n_components=number_topics, n_jobs=-1)
lda.fit(count_data)
# Print the topics found by the LDA model
print("Topics found via LDA:")
print_topics(lda, count_vectorizer, number_words)
Topics found via LDA:

Topic #0:
カンガルー抱きをやり直してみてください 赤ちゃんの体が傾いてしまう場合は うまくいかない場合はもう一度 オシリの位置が肩パッド側に片寄っています ポーチの中で位置をずらすか カンガルー抱きをした時に カンガルー抱きの時の赤ちゃんの足の位置は オシリよりも低い位置にならないよう 充分注意しましょう 見た目の美しさではなく安全性と実用性ということを念頭に置くとよいでしょう

Topic #1:
baby carrier safe waist friendly cute child eco quality hipseat

Topic #2:
見た目の美しさではなく安全性と実用性ということを念頭に置くとよいでしょう スリングを選ぶ時に大切なのことは 赤ちゃんの顔が外を向く抱っこの方法です カンガルー抱きは なんてもったいない スリングをせっかく手に入れたものの 使い方が分からずタンスの肥やし 眠ってしまった赤ちゃんの様子を見ながら リングありスリングならばそうっとポーチをゆるめていきます とにかく不器用で扱い方に自信がない方は

Topic #3:
poorjason hated sprains sitstill sister secondhandbookshop pain misery medical jason

Topic #4:
リングありスリングならスリングを掛けた側の手でテールを引き 注意深く観察しましょう 適度に密着するまで調節をします ひざ裏に赤い痕が付く場合がありますので 寄り添い抱きはおしりが沈みすぎると 寄り添い抱きをする場合 見た目の美しさではなく安全性と実用性ということを念頭に置くとよいでしょう スリングを選ぶ時に大切なのことは 赤ちゃんの顔が外を向く抱っこの方法です カンガルー抱きは

Topic #5:
ことにあります 赤ちゃんとコミュニケーションをとる スリング上達の一番のコツは スリングを上手に使いこなすためのコツは 吐き戻ししやすい赤ちゃんの場合 カンガルー抱きは空腹時がオススメです ママがスリングを装着するところから既に始まっています 見た目の美しさではなく安全性と実用性ということを念頭に置くとよいでしょう スリングを選ぶ時に大切なのことは 赤ちゃんの顔が外を向く抱っこの方法です

Topic #6:
4歳ぐらいのお子さんまで抱っこすることが スリング ベビースリングとは 新生児から大体3 私たちユーザーの間では 簡単に と呼んでいます できる布状の抱っこ紐です レールに綿の入ったベルトテールタイプが扱いやすいでしょう とにかく不器用で扱い方に自信がない方は

Topic #7:
get bitly w6bc84u7uz local receive rcaantenna 2zoobu8 free offer sign

Topic #8:
follows bag steps fashiontravelslingbag crafts top arts beauty bck2ziaade bagwholesale

Topic #9:
v6r6qa9z3x make babyboy babycarrier babygirl know babysling babywearing ticks sure

Sling Topics Interpretation:

Per our observation noted in the word cloud exercise, Sling TV shares the hashtag #sling with a baby sling product. This observation shows the limitation using hash tag research. The baby related sling tweets dominated the query.

As a result, only Topic #6 is related to Sling TV and it revolves around free sign up.

LDA - AppleTV

In [65]:
AppleTV_LDA = AppleTV
AppleTV_LDA = pd.DataFrame(AppleTV_LDA)
# Remove punctuation
AppleTV_LDA[0] = AppleTV_LDA[0].map(lambda x: re.sub('[,\.!?]', '', x))
# Convert the titles to lowercase
AppleTV_LDA[0] = AppleTV_LDA[0].map(lambda x: x.lower())
# Print out the first rows of papers
AppleTV_LDA.head()
Out[65]:
0
0 missed an episode of #techstrongtv this week d...
1 regarder le mini serie défendre jacob exclusiv...
2 news shotgun 5/23 https://wpme/p7sesl-ktc  #ne...
3 i bought #appletv because i wanted to watch @c...
4 apple is buying content aggressively to challe...
In [66]:
# Join the different processed titles together.
long_string = ','.join(list(AppleTV_LDA[0].values))
# Create a WordCloud object
wordcloud = WordCloud(width = 600, height = 400, background_color="white", max_words=5000)
# Generate a word cloud
wordcloud.generate(long_string)
# Visualize the word cloud
wordcloud.to_image()
Out[66]:
In [67]:
# Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer(stop_words=STOPWORDS + ['appletv', 'pictwittercom', 'http', 'apple', 'de', 'en'])

# Fit and transform the processed titles
count_data = count_vectorizer.fit_transform(AppleTV_LDA[0])

# Visualise the 10 most common words
plot_10_most_common_words(count_data, count_vectorizer)
In [68]:
# Tweak the two parameters below
number_topics = 10
number_words = 10
# Create and fit the LDA model
lda = LDA(n_components=number_topics, n_jobs=-1)
lda.fit(count_data)
# Print the topics found by the LDA model
print("Topics found via LDA:")
print_topics(lda, count_vectorizer, number_words)
Topics found via LDA:

Topic #0:
streaming series jacob serie dbserie watch homebeforedark défendre mini exclusivement

Topic #1:
4k con compatible smarttv lo televisor procesador hdr alexa samsung

Topic #2:
news hero chrisevans defendingjacob 23 ps volumes andy joanna people

Topic #3:
tv bundesliga prime schalke und die angebot streaming amazon smart

Topic #4:
netflix para content episodes 05 mais tom uma salas vitima

Topic #5:
home amazon オカモト q9xzjx7vav stayhome stay よく一緒に購入している商品がこちら なんで の充実を図るべくamazonでappletvを検索したのだが watch

Topic #6:
great watch one show mythicquest twittercom lil mystery status favor

Topic #7:
defendingjacob episode next wait think finale best ooooooh 1pqy0gspbj friday

Topic #8:
app sont news bitly previous tstv 3brtmeh ku6gcmrye6 missed page

Topic #9:
greyhound le sur tv sortira film finalement uss tomhanks i6pzxz7k75

AppleTV Topics Interpretation:

  • Topic #0: Topic #0 is related to binge watching the news.
  • Topic #1: Topic #1 is about the compatibility between Amazon and Samsung.
  • Topic #2: This topic is about missing episodes.
  • Topic #3: this topic is about Chris Evans’ Emmy nomination.
  • Topic #4: this topic is about new shows on AppleTV.

This batch of topics is brand specific around AppleTV and the Amazon and Samsung brands.

LDA - DisneyTV

In [69]:
DisneyTV_LDA = DisneyTV
DisneyTV_LDA = pd.DataFrame(DisneyTV_LDA)
# Remove punctuation
DisneyTV_LDA[0] = DisneyTV_LDA[0].map(lambda x: re.sub('[,\.!?]', '', x))
# Convert the titles to lowercase
DisneyTV_LDA[0] = DisneyTV_LDA[0].map(lambda x: x.lower())
# Print out the first rows of papers
DisneyTV_LDA.head()
Out[69]:
0
0 stop the press \n\ncan we all give a huge con...
1 anyone else watching disney tv whilst home lo...
2 i could no longer restrain my tears\nbut in th...
3 still need to get on watching through all of g...
4 the walt disney company: please give fans a st...
In [70]:
# Join the different processed titles together.
long_string = ','.join(list(DisneyTV_LDA[0].values))
# Create a WordCloud object
wordcloud = WordCloud(width = 600, height = 400, background_color="white", max_words=5000)
# Generate a word cloud
wordcloud.generate(long_string)
# Visualize the word cloud
wordcloud.to_image()
Out[70]:
In [71]:
# Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer(stop_words=STOPWORDS + ['disneytv', 'disney', 'https', 'pictwittercom', 'bitly'])

# Fit and transform the processed titles
count_data = count_vectorizer.fit_transform(DisneyTV_LDA[0])

# Visualise the 10 most common words
plot_10_most_common_words(count_data, count_vectorizer)
In [72]:
# Tweak the two parameters below
number_topics = 10
number_words = 10
# Create and fit the LDA model
lda = LDA(n_components=number_topics, n_jobs=-1)
lda.fit(count_data)
# Print the topics found by the LDA model
print("Topics found via LDA:")
print_topics(lda, count_vectorizer, number_words)
Topics found via LDA:

Topic #0:
movie disneyplus waltdisney igshid wwwinstagramcom bluray bluraycollection movielover cinephile b_zwqr2fspa

Topic #1:
great want next started write show get di iflixvipmurah ini

Topic #2:
still shego sixfanarts gargoyles disneytva round cartoon wacom zrqa7wpill possible

Topic #3:
show tonight singalong abc coming womeninanimation throwbackthursday tbt best creating

Topic #4:
avrupa piyasaya netflix ülkelerinde suruldu netflixturkiye sürüldü netflixoneri muratovuc haber

Topic #5:
watching need shares disneyvlog anyone animalkingdom dvc wdw disneyvlogger tv

Topic #6:
season company evil forces star via walt vgmxrfdk chngit change

Topic #7:
family together one video episode iun4 ai imagineers f5zih5pt5o enhanced

Topic #8:
like doraemon kiteretsu get please 3am 2am nostalgic disneyplushs hungama

Topic #9:
stop team elenaofavalor give daytime smiling congratulations 9sllyjjfzh avalor elenamazing

DisneyTV Topics Interpretation:

  • Topic #0: This topic is about the release of Elena of Avalor.
  • Topic #1: This topic seems to be about needing new writers.
  • Topic #2: This topic seems to be about watching Disney+ together.
  • Topic #3: This topic is about Throw Back Thursday (tbt).
  • Topic #4: This topic seems to be about restoring and enhancing Disney videos.

LDA - ItsOnATT

In [73]:
ItsOnATT_LDA = ItsOnATT
ItsOnATT_LDA = pd.DataFrame(ItsOnATT_LDA)
# Remove punctuation
ItsOnATT_LDA[0] = ItsOnATT_LDA[0].map(lambda x: re.sub('[,\.!?]', '', x))
# Convert the titles to lowercase
ItsOnATT_LDA[0] = ItsOnATT_LDA[0].map(lambda x: x.lower())
# Print out the first rows of papers
ItsOnATT_LDA.head()
Out[73]:
0
0 "there’s a place where the witch of the west m...
1 #itsonatt time to go to #sling att is out again
2 darkus fangzor:\n(secrets of the bakugan)\n#ne...
3 shout out to @jasonasedillo\n#netflix\n#shorts...
4 #repost @katebock pre game welltoday it’s the...
In [74]:
# Join the different processed titles together.
long_string = ','.join(list(ItsOnATT_LDA[0].values))
# Create a WordCloud object
wordcloud = WordCloud(width = 600, height = 400, background_color="white", max_words=5000)
# Generate a word cloud
wordcloud.generate(long_string)
# Visualize the word cloud
wordcloud.to_image()
Out[74]:
In [75]:
# Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer(stop_words=STOPWORDS + ['itsonatt', 'https', 'twittercom'])

# Fit and transform the processed titles
count_data = count_vectorizer.fit_transform(ItsOnATT_LDA[0])

# Visualise the 10 most common words
plot_10_most_common_words(count_data, count_vectorizer)
In [76]:
# Tweak the two parameters below
number_topics = 10
number_words = 10
# Create and fit the LDA model
lda = LDA(n_components=number_topics, n_jobs=-1)
lda.fit(count_data)
# Print the topics found by the LDA model
print("Topics found via LDA:")
print_topics(lda, count_vectorizer, number_words)
Topics found via LDA:

Topic #0:
ladygaga supersaturdaynight edge gfc8z0kgba glory miami bekind littlemonsters status directv

Topic #1:
status directv netflix trueabq newmexicofilm nbcuniversal abqfilmoffice abqtech nmdevs japanese

Topic #2:
status abqecondev govmlg netflix directv abqtech abqfilmoffice trueabq nbcuniversal newmexicofilm

Topic #3:
film richardbranson underarmour virgingalactic abqecondev wild heart abqtech abqfilmoffice newmexicofilm

Topic #4:
status bakugan abqecondev netflix directv thisisabq innovate nmrx railrunner elonmusk

Topic #5:
gaga att http tv night nfl superbowl super lady 2uwtp7z

Topic #6:
directv netflix secrets film festival bakugan via filmfreewaycom submitted abqtech

Topic #7:
supersaturdaynight varietystudio idpdv5eysf puts siriusxm presented 72 mirandas listen lin_manuel

Topic #8:
game pre wwwinstagramcom repost welltoday katebock si_swimsuit igshid gaga b8j1c5hqqen

Topic #9:
hbomax meets status dark new enjoyed north fresh knight video

ItsOnATT Topics Interpretation:

  • Topic #0: This topic seems to show a relationship between streaming providers and movie houses.
  • Topic #1: Topic #0 and this topic don’t have a clear distinction between them. They appear to be very similar.
  • Topic #2: Topic #2 is about the Darkus Bakugan team @mentioned Elon Musk in a tweet.
  • Topic #3: This topic is related to a LadyGaga event in Miami.
  • Topic #4: This topic is about Richard Branson and Virgin Galactic.

AT&T is very brand specific and is associated with many other brands, such as NBC Universal, New Mexico Films, Direct TV, Netflix, the NFL, and many more. ItsOnATT is clearly the most brand associated of all the streaming services.

LDA - fuboTV

In [77]:
fuboTV_LDA = fuboTV
fuboTV_LDA = pd.DataFrame(fuboTV_LDA)
# Remove punctuation
fuboTV_LDA[0] = fuboTV_LDA[0].map(lambda x: re.sub('[,\.!?]', '', x))
# Convert the titles to lowercase
fuboTV_LDA[0] = fuboTV_LDA[0].map(lambda x: x.lower())
# Print out the first rows of papers
fuboTV_LDA.head()
Out[77]:
0
0 get #fubotv live #news #movies #global #sports...
1 get #fubotv live #news #movies #global #sports...
2 get #fubotv live #news #movies #global #sports...
3 #fubotv sports network is launching on viacomc...
4 #fubotv sports network is launching on viacomc...
In [78]:
# Join the different processed titles together.
long_string = ','.join(list(fuboTV_LDA[0].values))
# Create a WordCloud object
wordcloud = WordCloud(width = 600, height = 400, background_color="white", max_words=5000)
# Generate a word cloud
wordcloud.generate(long_string)
# Visualize the word cloud
wordcloud.to_image()
Out[78]:
In [79]:
# Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer(stop_words=STOPWORDS + ['fubotv', 'https', 'pictwittercom'])

# Fit and transform the processed titles
count_data = count_vectorizer.fit_transform(fuboTV_LDA[0])

# Visualise the 10 most common words
plot_10_most_common_words(count_data, count_vectorizer)
In [80]:
# Tweak the two parameters below
number_topics = 10
number_words = 10
# Create and fit the LDA model
lda = LDA(n_components=number_topics, n_jobs=-1)
lda.fit(count_data)
# Print the topics found by the LDA model
print("Topics found via LDA:")
print_topics(lda, count_vectorizer, number_words)
Topics found via LDA:

Topic #0:
sports free bitly viacomcbs launching consumers network 2zbqq3t plutotv vj5jnes3s5

Topic #1:
gonna day emails all30 families literally stars including hurt star

Topic #2:
fubo streaming last subscribers end 37 315789 paid 2018 year

Topic #3:
tv disneyplus cookies chip chewy playstationvue hbogo gooey baked twitch

Topic #4:
tv gooey chocolate chewy cookies crackle netflix hbogo twitch baked

Topic #5:
reveals 2019 subscriber revenue tv deadlinecom deadline surge total 05

Topic #6:
get global channel covid19 tv_watchfree mo freeyourtv stayhome movies days

Topic #7:
gooey hbonow hbogo disneyplus cookies chocolate chip chewy playstationvue netflix

Topic #8:
fox sinclair 20 affiliates tv live streaming reached coming platform

Topic #9:
know lastmanstanding ima tweet hope scamlikely friends friend single expensive

fuboTV Topics Interpretation:

  • Topic #0: This topic reveals a subscriber surge.
  • Topic #1: This topic is about watching Last Man Standing.
  • Topic #2: This topic seems to be some kind of comparison or association between fuboTV, Netflix, HBO Go, and Disney+.
  • Topic #3: Topic #3 seems to be about simple streaming solutions. Maybe fuboTV is positioning themselves as a simple streaming solution.
  • Topic #4: Topic #4 is an anti-cable topic because it revolves around no credit check and no deposit solicitations.

5. Perform K-means clustering of documents, using terms as variables. Extract 3-5 clusters. Using the cluster means, interpret the topics represented by the clusters.

return to top

In [81]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
from collections import Counter

Please note that each streaming service has it’s own elbow chart. The instructions were to extract between 3-5 clusters, however, there was no clear elbow between 3 – 5 clusters except for AppleTV, and fuboTV. Since AppleTV had its elbow at 4 clusters, we decided to implement a 4 cluster solution for each streaming service.

Please note, the interpretation of these clusters is purely subjective.

Clustering

YouTubeTV

In [82]:
# we always assume the max number of cluster would be 10
# you can judge the number of clusters by doing averaging
# method to visualize max no of clusters
def CreateElbowChart(input_to_fit):
    wcss = []
    for i in range(1,11):
        kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state =0)
        kmeans.fit(input_to_fit)
        wcss.append(kmeans.inertia_)
    
    plt.plot(range(1,11), wcss)
    plt.title('The Elbow Method')
    plt.xlabel('no of clusters')
    plt.ylabel('wcss')
    plt.show

CreateElbowChart(YouTubeTV_TFIDF)
In [83]:
stop_words = nltk.corpus.stopwords.words('english') + ['youtubetv', 'youtube', 'https', 'pictwittercom', 'pic']
#cv = CountVectorizer(ngram_range=(1,2),min_df=10, max_df=0.8, stop_words=stop_words)
cv = CountVectorizer(min_df=0., max_df=1.)
cv_matrix = cv.fit_transform(corpus_YouTubeTV)
# cv_matrix.shape

NUM_CLUSTERS = 4
km = KMeans(n_clusters=NUM_CLUSTERS, max_iter=10000, n_init=50, random_state=42).fit(cv_matrix)
km
# Counter(km.labels_)
Out[83]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=10000,
       n_clusters=4, n_init=50, n_jobs=None, precompute_distances='auto',
       random_state=42, tol=0.0001, verbose=0)
In [84]:
corpus_YouTubeTV_df = pd.DataFrame({'Document': corpus_YouTubeTV})
corpus_YouTubeTV_df['kmeans_cluster'] = km.labels_
corpus_YouTubeTV_df
Out[84]:
Document kmeans_cluster
0 wednesday hbomax available youtubetv subscribe... 0
1 youtubetv 0
2 youtubetv review channel lineup dvr local chan... 0
3 make youtube channel easy steps tips beginners... 0
4 youtubetv hbomax internet package includes hbo... 0
... ... ...
4995 coffeebreak time midweeker yall missed last ni... 0
4996 new single quato ft remel wanna ball dropping ... 1
4997 woah woah woahhh guy actually tattoo inside li... 0
4998 youtubetv tcm work deal allow authentication t... 0
4999 officially binge watching restaurantstartup yo... 0

5000 rows × 2 columns

In [85]:
YouTubeClusters = corpus_YouTubeTV_df.groupby('kmeans_cluster').head(20)
YouTubeClusters = YouTubeClusters.copy(deep=True)

feature_names = cv.get_feature_names()
topn_features = 15
ordered_centroids = km.cluster_centers_.argsort()[:, ::-1]

for cluster_num in range(NUM_CLUSTERS):
    key_features = [feature_names[index]
                       for index in ordered_centroids[cluster_num, :topn_features]]
    testing = YouTubeClusters[YouTubeClusters['kmeans_cluster'] == cluster_num].values.tolist()
    print('CLUSTER #'+str(cluster_num+1))
    print('Key Features:', key_features)
CLUSTER #1
Key Features: ['youtubetv', 'youtube', 'channel', 'tv', 'hbomax', 'subscribers', 'pic', 'twitter', 'video', 'better', 'streaming', 'woah', 'channels', 'autoedition', 'google']
CLUSTER #2
Key Features: ['youtubetv', 'nextup', 'unitedmasters', 'may', 'beverlyhills', 'httpswww', 'hollywood', 'new', 'godsplan', 'ball', 'onmywayup', 'ft', 'single', 'compcazuyozghlwigshidetxxzzwibm', 'dropping']
CLUSTER #3
Key Features: ['youtubetv', 'marqueenetwork', 'close', 'comcast', 'th', 'negotiations', 'operations', 'business', 'ops', 'get', 'mlb', 'commarqueeclosetocomcastyoutubedeals', 'httpsmajorleagueaholes', 'completed', 'baseball']
CLUSTER #4
Key Features: ['youtubetv', 'comftfvhoow', 'hulu', 'twitter', 'may', 'microsofts', 'nationalcabletelevisioncooperative', 'chartercommunications', 'new', 'sonys', 'made', 'pic', 'playstation', 'samsung', 'coxcommunications']

YouTubeTV Cluster Interpretation:

  • Cluster 1: centered around HBO Max subscribers
  • Cluster 2: centered around the West Coast (Hollywood, Beverly Hills)
  • Cluster 3: centered around business negotiations involving comcast and the Marquee Network
  • Cluster 4: centered around some kind of cable television cooperative

Hulu

In [86]:
CreateElbowChart(Hulu_TFIDF)
In [87]:
stop_words = nltk.corpus.stopwords.words('english') + ['hulu', 'https', 'pictwittercom', 'pic', 'e3', '81', 'aa', '89', '82']
#cv = CountVectorizer(ngram_range=(1,2),min_df=10, max_df=0.8, stop_words=stop_words)
cv = CountVectorizer(min_df=0., max_df=1.)
cv_matrix = cv.fit_transform(corpus_Hulu)
# cv_matrix.shape

NUM_CLUSTERS = 4
km = KMeans(n_clusters=NUM_CLUSTERS, max_iter=10000, n_init=50, random_state=42).fit(cv_matrix)
km
# Counter(km.labels_)
Out[87]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=10000,
       n_clusters=4, n_init=50, n_jobs=None, precompute_distances='auto',
       random_state=42, tol=0.0001, verbose=0)
In [88]:
corpus_Hulu_df = pd.DataFrame({'Document': corpus_Hulu})
corpus_Hulu_df['kmeans_cluster'] = km.labels_
corpus_Hulu_df
Out[88]:
Document kmeans_cluster
0 hulu rt httpstep met 0
1 gooddoctorabc hulu great series 0
2 folks lets never grow checking sonic x hulu hu... 0
3 breakingbad netflix amazon hulu pic twitter co... 0
4 watching thegreat hulu ready binge season two ... 3
... ... ...
4995 livehomelysdvd idol fire ver hulu 0
4996 davidspade hi dave justshootme episodes hulu r... 0
4997 chillin watching goldbergs debating playing sw... 0
4998 last lover kismyft hulu peacefuldays radiko as... 0
4999 many books come find provocative novels short ... 2

5000 rows × 2 columns

In [89]:
HuluClusters = corpus_Hulu_df.groupby('kmeans_cluster').head(20)
HuluClusters = HuluClusters.copy(deep=True)

feature_names = cv.get_feature_names()
topn_features = 15
ordered_centroids = km.cluster_centers_.argsort()[:, ::-1]

for cluster_num in range(NUM_CLUSTERS):
    key_features = [feature_names[index]
                       for index in ordered_centroids[cluster_num, :topn_features]]
    testing = HuluClusters[HuluClusters['kmeans_cluster'] == cluster_num].values.tolist()
    print('CLUSTER #'+str(cluster_num+1))
    print('Key Features:', key_features)
CLUSTER #1
Key Features: ['hulu', 'netflix', 'unext', 'nbak', 'fire', 'great', 'series', 'alexandros', 'amazon', 'watching', 'hulus', 'hulualexandros', 'huluaccount', 'httpsyokaranusekai', 'httpswww']
CLUSTER #2
Key Features: ['movies', 'available', 'marvel', 'thcenturyfox', 'documentary', 'moviesnation', 'mystry', 'sony', 'netflix', 'forever', 'scifi', 'comwvgdbaxf', 'entertainment', 'pic', 'drama']
CLUSTER #3
Key Features: ['story', 'hollywood', 'hbo', 'book', 'books', 'poet', 'short', 'find', 'netflix', 'fiction', 'collections', 'come', 'erotica', 'novels', 'indie']
CLUSTER #4
Key Features: ['two', 'season', 'lee', 'like', 'hulu', 'hit', 'harsh', 'nnnnnnnnnnoooooooooo', 'norris', 'ready', 'chuck', 'bruce', 'boom', 'birthday', 'binge']

Hulu Cluster Interpretation:

  • Cluster 1: centered around Alex Andros, who may be a Japanese performer
  • Cluster 2: centered around sci-fi and Marvel movies
  • Cluster 3: centered around books and poetry
  • Cluster 4: centered around Bruce Lee and Chuck Norris. They were in an old school classic karate movie together.

Philo

In [90]:
CreateElbowChart(Philo_TFIDF)
In [91]:
stop_words = nltk.corpus.stopwords.words('english') + ['https', 'philo', 'philosophie', 'philosophie', 'learnet', 'labelles', 'est', 'tout', 'du', 'pic', 'de', 'le', 'la', 'et', 'les', 'en', 'vie', 'une', 'ce', 'ne']
#cv = CountVectorizer(ngram_range=(1,2),min_df=10, max_df=0.8, stop_words=stop_words)
cv = CountVectorizer(min_df=0., max_df=1.)
cv_matrix = cv.fit_transform(corpus_Philo)
# cv_matrix.shape

NUM_CLUSTERS = 4
km = KMeans(n_clusters=NUM_CLUSTERS, max_iter=10000, n_init=50, random_state=42).fit(cv_matrix)
km
# Counter(km.labels_)
Out[91]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=10000,
       n_clusters=4, n_init=50, n_jobs=None, precompute_distances='auto',
       random_state=42, tol=0.0001, verbose=0)
In [92]:
corpus_Philo_df = pd.DataFrame({'Document': corpus_Philo})
corpus_Philo_df['kmeans_cluster'] = km.labels_
corpus_Philo_df
Out[92]:
Document kmeans_cluster
0 pourquoi seraitil au fond desirable detre soup... 1
1 observer le monde cest le comprendre philosoph... 1
2 le langage et la communication dans la societe... 1
3 click link checkout acoustic version sisters m... 1
4 maintenant en direct episode de mon laboratoir... 1
... ... ...
4995 tout est dans le regard et les yeux demasquent... 1
4996 tout est dans le regard et les yeux demasquent... 1
4997 conferences de philosophie sur la chaine youtu... 1
4998 le tout mene lessentiel philosophie philo veri... 1
4999 le langage et la communication dans la societe... 1

5000 rows × 2 columns

In [93]:
PhiloClusters = corpus_Philo_df.groupby('kmeans_cluster').head(20)
PhiloClusters = PhiloClusters.copy(deep=True)

feature_names = cv.get_feature_names()
topn_features = 15
ordered_centroids = km.cluster_centers_.argsort()[:, ::-1]

for cluster_num in range(NUM_CLUSTERS):
    key_features = [feature_names[index]
                       for index in ordered_centroids[cluster_num, :topn_features]]
    testing = PhiloClusters[PhiloClusters['kmeans_cluster'] == cluster_num].values.tolist()
    print('CLUSTER #'+str(cluster_num+1))
    print('Key Features:', key_features)
CLUSTER #1
Key Features: ['de', 'les', 'zeus', 'ou', 'citation', 'sabreuvent', 'citationdujour', 'citations', 'comconnaisdabordlesquatreracinesdetouteschoses', 'racines', 'qui', 'quatre', 'comnhcbnduzp', 'pic', 'philo']
CLUSTER #2
Key Features: ['philo', 'philosophie', 'le', 'et', 'la', 'de', 'pic', 'pensee', 'twitter', 'httpstwitter', 'tout', 'nouvelle', 'du', 'artiste', 'societe']
CLUSTER #3
Key Features: ['une', 'vie', 'autres', 'bonne', 'les', 'html', 'frlivresarticlequestcequuneviebonnedejudithbutlerlefeuilletonlitterairedecamillelaurens', 'avec', 'sans', 'sera', 'de', 'si', 'dois', 'serait', 'ce']
CLUSTER #4
Key Features: ['en', 'chose', 'plus', 'descartes', 'la', 'chacun', 'bonsens', 'populisme', 'sont', 'car', 'du', 'si', 'pourvu', 'ceux', 'point']

Philo Clusters:

  • Philo cluster interpretation is hard to ascertain.
  • As a next step, the resulting feature set needs to be optimized before clear, legible clusters can be ascertained.

Sling

In [94]:
CreateElbowChart(Sling_TFIDF)
C:\Users\jonat\Anaconda3\envs\MachineLearning\lib\site-packages\ipykernel_launcher.py:8: ConvergenceWarning: Number of distinct clusters (7) found smaller than n_clusters (8). Possibly due to duplicate points in X.
  
C:\Users\jonat\Anaconda3\envs\MachineLearning\lib\site-packages\ipykernel_launcher.py:8: ConvergenceWarning: Number of distinct clusters (7) found smaller than n_clusters (9). Possibly due to duplicate points in X.
  
C:\Users\jonat\Anaconda3\envs\MachineLearning\lib\site-packages\ipykernel_launcher.py:8: ConvergenceWarning: Number of distinct clusters (7) found smaller than n_clusters (10). Possibly due to duplicate points in X.
  
In [95]:
stop_words = nltk.corpus.stopwords.words('english') + ['sling', 'pictwittercom', 'https', 'http', 'pic']
#cv = CountVectorizer(ngram_range=(1,2),min_df=10, max_df=0.8, stop_words=stop_words)
cv = CountVectorizer(min_df=0., max_df=1.)
cv_matrix = cv.fit_transform(corpus_Sling)
# cv_matrix.shape

NUM_CLUSTERS = 4
km = KMeans(n_clusters=NUM_CLUSTERS, max_iter=10000, n_init=50, random_state=42).fit(cv_matrix)
km
# Counter(km.labels_)
Out[95]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=10000,
       n_clusters=4, n_init=50, n_jobs=None, precompute_distances='auto',
       random_state=42, tol=0.0001, verbose=0)
In [96]:
corpus_Sling_df = pd.DataFrame({'Document': corpus_Sling})
corpus_Sling_df['kmeans_cluster'] = km.labels_
corpus_Sling_df
Out[96]:
Document kmeans_cluster
0 sling 0
1 sling 0
2 sling 0
3 high quality ecofriendly baby carrierbaby slin... 2
4 jason hated big sister fractures sling sprains... 3
... ... ...
1995 sling 0
1996 know c k safe babywearing cozitot safety newbo... 1
1997 sign sling get free rcaantenna receive local t... 0
1998 topbeauty arts crafts follows steps fashiontra... 0
1999 sling 0

2000 rows × 2 columns

In [97]:
SlingClusters = corpus_Sling_df.groupby('kmeans_cluster').head(20)
SlingClusters = SlingClusters.copy(deep=True)

feature_names = cv.get_feature_names()
topn_features = 15
ordered_centroids = km.cluster_centers_.argsort()[:, ::-1]

for cluster_num in range(NUM_CLUSTERS):
    key_features = [feature_names[index]
                       for index in ordered_centroids[cluster_num, :topn_features]]
    testing = SlingClusters[SlingClusters['kmeans_cluster'] == cluster_num].values.tolist()
    print('CLUSTER #'+str(cluster_num+1))
    print('Key Features:', key_features)
CLUSTER #1
Key Features: ['sling', 'twitter', 'pic', 'walkabout', 'deal', 'minishoulder', 'lyzoobu', 'local', 'lavender', 'httpsebay', 'get', 'free', 'follows', 'fashiontravelslingbag', 'ebay']
CLUSTER #2
Key Features: ['one', 'sahd', 'lyillxznud', 'comfortable', 'comes', 'comvrqazx', 'newborn', 'newdad', 'newmom', 'little', 'know', 'parenting', 'pic', 'cozitot', 'safe']
CLUSTER #3
Key Features: ['baby', 'sling', 'carrier', 'pic', 'safe', 'hipseat', 'waistbaby', 'ecofriendly', 'cute', 'comiwhqrudr', 'child', 'products', 'quality', 'carrierbaby', 'high']
CLUSTER #4
Key Features: ['accidents', 'jason', 'bones', 'books', 'poorjason', 'compcahphrhhpfjigshidwhjwjzkc', 'misery', 'medical', 'instagram', 'big', 'injury', 'httpswww', 'emergency', 'firstaid', 'hated']

Sling Cluster Interpretation:

  • The Sling clusters have the same issues as reported in the word cloud section as well as the LDA topic section. The #sling hashtag was dominated by a baby sling product.
  • Next steps would consist of possibly using @sling instead of #sling as a search query. However, @sling messages are pushed by the marketing team at the company and may not capture consumer sentiment.

AppleTV

In [98]:
CreateElbowChart(AppleTV_TFIDF)
In [99]:
stop_words = nltk.corpus.stopwords.words('english') + ['appletv', 'pictwittercom', 'http', 'pic', 'apple', 'de', 'en']
#cv = CountVectorizer(ngram_range=(1,2),min_df=10, max_df=0.8, stop_words=stop_words)
cv = CountVectorizer(min_df=0., max_df=1.)
cv_matrix = cv.fit_transform(corpus_AppleTV)
# cv_matrix.shape

NUM_CLUSTERS = 4
km = KMeans(n_clusters=NUM_CLUSTERS, max_iter=10000, n_init=50, random_state=42).fit(cv_matrix)
km
# Counter(km.labels_)
Out[99]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=10000,
       n_clusters=4, n_init=50, n_jobs=None, precompute_distances='auto',
       random_state=42, tol=0.0001, verbose=0)
In [100]:
corpus_AppleTV_df = pd.DataFrame({'Document': corpus_AppleTV})
corpus_AppleTV_df['kmeans_cluster'] = km.labels_
corpus_AppleTV_df
Out[100]:
Document kmeans_cluster
0 missed episode techstrongtv week dont sweat st... 0
1 regarder le mini serie defendre jacob exclusiv... 0
2 news shotgun httpswp mepseslktc news snydercut... 0
3 bought appletv wanted watch chrisevans defendi... 0
4 apple buying content aggressively challenge ne... 0
... ... ...
1995 whoa love series keeps guessing till end twist... 0
1996 tonights binge watching homebeforedark appletv... 0
1997 appletv favor watch mythicquest great fucking ... 0
1998 ha encantado muy divertida original apple appl... 0
1999 stay home amazonappletv appletv amazon stayhom... 0

2000 rows × 2 columns

In [101]:
AppleTVClusters = corpus_AppleTV_df.groupby('kmeans_cluster').head(20)
AppleTVClusters = AppleTVClusters.copy(deep=True)

feature_names = cv.get_feature_names()
topn_features = 15
ordered_centroids = km.cluster_centers_.argsort()[:, ::-1]

for cluster_num in range(NUM_CLUSTERS):
    key_features = [feature_names[index]
                       for index in ordered_centroids[cluster_num, :topn_features]]
    testing = AppleTVClusters[AppleTVClusters['kmeans_cluster'] == cluster_num].values.tolist()
    print('CLUSTER #'+str(cluster_num+1))
    print('Key Features:', key_features)
CLUSTER #1
Key Features: ['appletv', 'apple', 'pic', 'twitter', 'watch', 'streaming', 'news', 'greyhound', 'netflix', 'next', 'home', 'sur', 'great', 'le', 'homebeforedark']
CLUSTER #2
Key Features: ['hero', 'people', 'joanna', 'think', 'neals', 'even', 'ep', 'gut', 'legend', 'give', 'spoke', 'chrisevans', 'hated', 'judge', 'appletv']
CLUSTER #3
Key Features: ['de', 'con', 'panel', 'grande', 'twitter', 'consiguelo', 'totwkdl', 'gratis', 'lo', 'appletv', 'goza', 'este', 'televisor', 'smarttv', 'solo']
CLUSTER #4
Key Features: ['tv', 'und', 'die', 'prime', 'bundesliga', 'appletv', 'hat', 'entsprechende', 'bereits', 'kann', 'schalke', 'app', 'mitverfolgen', 'live', 'fireangebot']

AppleTV Cluster Interpretation:

  • Cluster 1: centered around streaming Home Before Dark
  • Cluster 2: centered around the Marvel legend Chris Evans
  • Cluster 3: this cluster is hard to describe because it is centered around a foreign language
  • Cluster 4: this cluster is also hard to describe because it is centered around a foreign language

DisneyTV

In [102]:
CreateElbowChart(DisneyTV_TFIDF)
In [103]:
stop_words = nltk.corpus.stopwords.words('english') + ['disneytv', 'disney', 'https', 'pictwittercom', 'bitly', 'pic']
#cv = CountVectorizer(ngram_range=(1,2),min_df=10, max_df=0.8, stop_words=stop_words)
cv = CountVectorizer(min_df=0., max_df=1.)
cv_matrix = cv.fit_transform(corpus_DisneyTV)
# cv_matrix.shape

NUM_CLUSTERS = 4
km = KMeans(n_clusters=NUM_CLUSTERS, max_iter=10000, n_init=50, random_state=42).fit(cv_matrix)
km
# Counter(km.labels_)
Out[103]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=10000,
       n_clusters=4, n_init=50, n_jobs=None, precompute_distances='auto',
       random_state=42, tol=0.0001, verbose=0)
In [104]:
corpus_DisneyTV_df = pd.DataFrame({'Document': corpus_DisneyTV})
corpus_DisneyTV_df['kmeans_cluster'] = km.labels_
corpus_DisneyTV_df
Out[104]:
Document kmeans_cluster
0 stop press give huge congratulations elenamazi... 1
1 anyone else watching disney tv whilst home lov... 0
2 could longer restrain tears end still together... 0
3 still need get watching gargoyles disney disne... 0
4 walt disney company please give fans star vs f... 3
... ... ...
1995 sibling side life soooooooo blessed happy nati... 1
1996 tbt interview daron nefcy creating show star v... 1
1997 disneytv crew coming together virtualtagtuesda... 1
1998 disney avrupa ulkelerinde piyasaya suruldu htt... 0
1999 discovering awesome gravity falls quarentined ... 0

2000 rows × 2 columns

In [105]:
DisneyTVClusters = corpus_DisneyTV_df.groupby('kmeans_cluster').head(20)
DisneyTVClusters = DisneyTVClusters.copy(deep=True)

feature_names = cv.get_feature_names()
topn_features = 15
ordered_centroids = km.cluster_centers_.argsort()[:, ::-1]

for cluster_num in range(NUM_CLUSTERS):
    key_features = [feature_names[index]
                       for index in ordered_centroids[cluster_num, :topn_features]]
    testing = DisneyTVClusters[DisneyTVClusters['kmeans_cluster'] == cluster_num].values.tolist()
    print('CLUSTER #'+str(cluster_num+1))
    print('Key Features:', key_features)
CLUSTER #1
Key Features: ['disney', 'disneytv', 'httpswww', 'like', 'still', 'watching', 'disneyplus', 'movie', 'disneytva', 'instagram', 'tonight', 'twitter', 'netflix', 'doraemon', 'waltdisney']
CLUSTER #2
Key Features: ['disneytv', 'pic', 'twitter', 'show', 'great', 'httpsbit', 'get', 'write', 'disney', 'want', 'next', 'heres', 'started', 'elenaofavalor', 'team']
CLUSTER #3
Key Features: ['di', 'jualanakunpremium', 'viupremum', 'bulan', 'buruan', 'spotifypremiummurah', 'spotifypremium', 'iflixmurah', 'disneytv', 'promo', 'penuh', 'order', 'nordvpn', 'netflixpremium', 'netflixmurah']
CLUSTER #4
Key Features: ['future', 'disneytv', 'give', 'petition', 'via', 'forces', 'smarcostar', 'fans', 'special', 'change', 'disneyxd', 'movie', 'evil', 'ssvtfoe', 'star']

DisneyTV Cluster Interpretation:

  • Cluster 1: centered around watching Doraemon on Disney Plus
  • Cluster 2: centered around getting a great writing team or having a great writing team
  • Cluster 3: centered around some type of promo towards DisneyTV from competitors or away from DisneyTV towards competitor services
  • Cluster 4: centered around some type of petition from fans

ItsOnATT

In [106]:
CreateElbowChart(ItsOnATT_TFIDF)
In [107]:
stop_words = nltk.corpus.stopwords.words('english') + ['itsonatt', 'https', 'twittercom']
#cv = CountVectorizer(ngram_range=(1,2),min_df=10, max_df=0.8, stop_words=stop_words)
cv = CountVectorizer(min_df=0., max_df=1.)
cv_matrix = cv.fit_transform(corpus_ItsOnATT)
# cv_matrix.shape

NUM_CLUSTERS = 4
km = KMeans(n_clusters=NUM_CLUSTERS, max_iter=10000, n_init=50, random_state=42).fit(cv_matrix)
km
# Counter(km.labels_)
Out[107]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=10000,
       n_clusters=4, n_init=50, n_jobs=None, precompute_distances='auto',
       random_state=42, tol=0.0001, verbose=0)
In [108]:
corpus_ItsOnATT_df = pd.DataFrame({'Document': corpus_ItsOnATT})
corpus_ItsOnATT_df['kmeans_cluster'] = km.labels_
corpus_ItsOnATT_df
Out[108]:
Document kmeans_cluster
0 theres place witch west meets king north fresh... 0
1 itsonatt time go sling att 3
2 darkus fangzor secrets bakugan netflix shortst... 3
3 shout jasonasedillo netflix shortstv directv i... 3
4 repost katebock pre game well today gaga pre g... 2
... ... ...
1995 submitted secrets bakugan rushes national film... 1
1996 submitted wild heart guitars nd asia destinati... 1
1997 working wild heart cine llc create dark comedy... 3
1998 looks great wondering locally owned operated m... 1
1999 damn missed abqecondev newmexicofilm abqfilmof... 1

2000 rows × 2 columns

In [109]:
ItsOnATTClusters = corpus_ItsOnATT_df.groupby('kmeans_cluster').head(20)
ItsOnATTClusters = ItsOnATTClusters.copy(deep=True)

feature_names = cv.get_feature_names()
topn_features = 15
ordered_centroids = km.cluster_centers_.argsort()[:, ::-1]

for cluster_num in range(NUM_CLUSTERS):
    key_features = [feature_names[index]
                       for index in ordered_centroids[cluster_num, :topn_features]]
    testing = ItsOnATTClusters[ItsOnATTClusters['kmeans_cluster'] == cluster_num].values.tolist()
    print('CLUSTER #'+str(cluster_num+1))
    print('Key Features:', key_features)
CLUSTER #1
Key Features: ['meets', 'itsonatt', 'video', 'dark', 'promo', 'play', 'place', 'north', 'comhbomaxstatus', 'enjoyed', 'new', 'fresh', 'theres', 'hbomax', 'prince']
CLUSTER #2
Key Features: ['newmexicofilm', 'directv', 'abqfilmoffice', 'abqtech', 'netflix', 'nbcuniversal', 'nmdevs', 'itsonatt', 'trueabq', 'abqecondev', 'film', 'bakugan', 'httpfilmfreeway', 'submitted', 'httpstwitter']
CLUSTER #3
Key Features: ['game', 'pre', 'httpswww', 'siswimsuit', 'gaga', 'today', 'well', 'repost', 'instagram', 'itsonatt', 'katebock', 'compbmqysgkoyqigshidnjbocxvo', 'compbjchqqenigshidgndypilko', 'elonmusk', 'festival']
CLUSTER #4
Key Features: ['itsonatt', 'httpstwitter', 'ladygaga', 'supersaturdaynight', 'directv', 'att', 'netflix', 'pic', 'twitter', 'nmrx', 'innovate', 'railrunner', 'gaga', 'shortstv', 'elonmusk']

ItsOnATT Cluster Interpretation:

  • Cluster 1: centered around an HBO Max promo
  • Cluster 2:centeed around streaming providers and production houses
  • Cluster 3: centered around Kate Bock and swimsuits. This may be Sports Illustrated related.
  • Cluster 4: centered around short film innovation

fuboTV

In [110]:
CreateElbowChart(fuboTV_TFIDF)
In [111]:
stop_words = nltk.corpus.stopwords.words('english') + ['fubotv', 'https', 'pictwittercom', 'pic']
#cv = CountVectorizer(ngram_range=(1,2),min_df=10, max_df=0.8, stop_words=stop_words)
cv = CountVectorizer(min_df=0., max_df=1.)
cv_matrix = cv.fit_transform(corpus_fuboTV)
# cv_matrix.shape

NUM_CLUSTERS = 4
km = KMeans(n_clusters=NUM_CLUSTERS, max_iter=10000, n_init=50, random_state=42).fit(cv_matrix)
km
# Counter(km.labels_)
Out[111]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=10000,
       n_clusters=4, n_init=50, n_jobs=None, precompute_distances='auto',
       random_state=42, tol=0.0001, verbose=0)
In [112]:
corpus_fuboTV_df = pd.DataFrame({'Document': corpus_fuboTV})
corpus_fuboTV_df['kmeans_cluster'] = km.labels_
corpus_fuboTV_df
Out[112]:
Document kmeans_cluster
0 get fubotv live news movies global sports show... 2
1 get fubotv live news movies global sports show... 2
2 get fubotv live news movies global sports show... 2
3 fubotv sports network launching viacomcbss fre... 3
4 fubotv sports network launching viacomcbss fre... 3
... ... ...
1995 solaropposites watching amazing tv gooey chewy... 0
1996 watching amazing tv gooey chewy freshly baked ... 0
1997 getallthree watching amazing tv gooey chewy fr... 0
1998 fubotv reveals subscriber metrics says revenue... 3
1999 end last year fubo paid subscribers fubotv dea... 3

2000 rows × 2 columns

In [113]:
fuboTVClusters = corpus_fuboTV_df.groupby('kmeans_cluster').head(20)
fuboTVClusters = fuboTVClusters.copy(deep=True)

feature_names = cv.get_feature_names()
topn_features = 15
ordered_centroids = km.cluster_centers_.argsort()[:, ::-1]

for cluster_num in range(NUM_CLUSTERS):
    key_features = [feature_names[index]
                       for index in ordered_centroids[cluster_num, :topn_features]]
    testing = fuboTVClusters[fuboTVClusters['kmeans_cluster'] == cluster_num].values.tolist()
    print('CLUSTER #'+str(cluster_num+1))
    print('Key Features:', key_features)
CLUSTER #1
Key Features: ['crackle', 'chewy', 'hbonow', 'hbogo', 'hulu', 'gooey', 'fubotv', 'freshly', 'netflix', 'newjersey', 'disneyplus', 'pic', 'playstationvue', 'cookies', 'chocolate']
CLUSTER #2
Key Features: ['fox', 'sinclair', 'fubotv', 'today', 'group', 'cordcuttersnews', 'reached', 'coming', 'live', 'local', 'tv', 'broadcast', 'comfubotvisaddingfoxaffiliates', 'bring', 'platform']
CLUSTER #3
Key Features: ['get', 'premium', 'live', 'dontgetcabled', 'stayhome', 'free', 'lytvwatchfree', 'sports', 'mo', 'movies', 'showtime', 'news', 'covid', 'nocreditcheck', 'nodeposit']
CLUSTER #4
Key Features: ['fubotv', 'sports', 'launching', 'gonna', 'network', 'httpsdeadline', 'freetoconsumers', 'know', 'plutotv', 'lastmanstanding', 'pic', 'httpsbit', 'deadline', 'day', 'lyzbqqt']

fuboTV Clusters:

  • Cluster 1: centered around competitors or partners such as HBO Go, Netflix, Disney+, and Hulu
  • Cluster 2: centered around cord cutters moving from big broadcast platforms to streaming tv
  • Cluster 3: centered around staying home for covid
  • Cluster 4: centered around the launch of Pluto TV

6. Can you generate a Brand Map from your analysis? Can you generate a list of Delighters/Disappointers for leading Brands in the category? If not – state the limitations of the analysis.

return to top

Limitations & delighters and disappointers

  • Yes; we can generate this list of delighters vs. disappointers by analyzing topics and comparing opinions across brands that are a results of LDA and analyzing the topics in tweets. Afterall, sentiment analysis is effectively opinion mining.
  • While we are analyzing Tweets, potentially by customers, we may have a bias toward recent events and not be capturing the full customer hierarchy of needs – we can map things across two dimensions at a time but that is not necessarily indicative of what customers value most
In [114]:
import numpy as np
import pandas as pd
import re

import time
import math
import re
from textblob import TextBlob
import pandas as pd

import nltk
import nltk as nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('wordnet')

import string

import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

import gensim
from gensim import corpora, models
from gensim.models.ldamulticore import LdaMulticore
import pyLDAvis.gensim
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\jonat\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
In [115]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()
In [116]:
def sentiment_parser(x):
    if x['compound'] <= -0.05:
        return 'negative'
    elif x['compound'] >= 0.05:
        return 'positive'
    else:
        return 'neutral'
In [117]:
YouTubeTVdf_t = pd.DataFrame({'Document': YouTubeTV})
Huludf_t = pd.DataFrame({'Document': Hulu})
Philodf_t = pd.DataFrame({'Document': Philo})
Slingdf_t = pd.DataFrame({'Document': Sling})
AppleTVdf_t = pd.DataFrame({'Document': AppleTV})
DisneyTVdf_t = pd.DataFrame({'Document': DisneyTV})
ItsOnATTdf_t = pd.DataFrame({'Document': ItsOnATT})
fuboTVdf_t = pd.DataFrame({'Document': fuboTV})
In [118]:
YouTubeTVdf_t['text_clean'] = YouTubeTVdf_t['Document'].map(lambda x: re.sub('[^a-zA-Z0-9  . , : - _]', '', str(x)))
YouTubeTVdf_t.text_clean = YouTubeTVdf_t.text_clean.str.lower()
YouTubeTVdf_t['vader_sentiment_test'] = YouTubeTVdf_t.text_clean.apply(lambda x:analyser.polarity_scores(x))
YouTubeTVdf_t['streamer_sentiment'] = YouTubeTVdf_t['vader_sentiment_test'].apply(lambda x:sentiment_parser(x))

Huludf_t['text_clean'] = Huludf_t['Document'].map(lambda x: re.sub('[^a-zA-Z0-9  . , : - _]', '', str(x)))
Huludf_t.text_clean = Huludf_t.text_clean.str.lower()
Huludf_t['vader_sentiment_test'] = Huludf_t.text_clean.apply(lambda x:analyser.polarity_scores(x))
Huludf_t['streamer_sentiment'] = Huludf_t['vader_sentiment_test'].apply(lambda x:sentiment_parser(x))

Philodf_t['text_clean'] = Philodf_t['Document'].map(lambda x: re.sub('[^a-zA-Z0-9  . , : - _]', '', str(x)))
Philodf_t.text_clean = Philodf_t.text_clean.str.lower()
Philodf_t['vader_sentiment_test'] = Philodf_t.text_clean.apply(lambda x:analyser.polarity_scores(x))
Philodf_t['streamer_sentiment'] = Philodf_t['vader_sentiment_test'].apply(lambda x:sentiment_parser(x))

Slingdf_t['text_clean'] = Slingdf_t['Document'].map(lambda x: re.sub('[^a-zA-Z0-9  . , : - _]', '', str(x)))
Slingdf_t.text_clean = Slingdf_t.text_clean.str.lower()
Slingdf_t['vader_sentiment_test'] = Slingdf_t.text_clean.apply(lambda x:analyser.polarity_scores(x))
Slingdf_t['streamer_sentiment'] = Slingdf_t['vader_sentiment_test'].apply(lambda x:sentiment_parser(x))

AppleTVdf_t['text_clean'] = AppleTVdf_t['Document'].map(lambda x: re.sub('[^a-zA-Z0-9  . , : - _]', '', str(x)))
AppleTVdf_t.text_clean = AppleTVdf_t.text_clean.str.lower()
AppleTVdf_t['vader_sentiment_test'] = AppleTVdf_t.text_clean.apply(lambda x:analyser.polarity_scores(x))
AppleTVdf_t['streamer_sentiment'] = AppleTVdf_t['vader_sentiment_test'].apply(lambda x:sentiment_parser(x))

DisneyTVdf_t['text_clean'] = DisneyTVdf_t['Document'].map(lambda x: re.sub('[^a-zA-Z0-9  . , : - _]', '', str(x)))
DisneyTVdf_t.text_clean = DisneyTVdf_t.text_clean.str.lower()
DisneyTVdf_t['vader_sentiment_test'] = DisneyTVdf_t.text_clean.apply(lambda x:analyser.polarity_scores(x))
DisneyTVdf_t['streamer_sentiment'] = DisneyTVdf_t['vader_sentiment_test'].apply(lambda x:sentiment_parser(x))

ItsOnATTdf_t['text_clean'] = ItsOnATTdf_t['Document'].map(lambda x: re.sub('[^a-zA-Z0-9  . , : - _]', '', str(x)))
ItsOnATTdf_t.text_clean = ItsOnATTdf_t.text_clean.str.lower()
ItsOnATTdf_t['vader_sentiment_test'] = ItsOnATTdf_t.text_clean.apply(lambda x:analyser.polarity_scores(x))
ItsOnATTdf_t['streamer_sentiment'] = ItsOnATTdf_t['vader_sentiment_test'].apply(lambda x:sentiment_parser(x))

fuboTVdf_t['text_clean'] = fuboTVdf_t['Document'].map(lambda x: re.sub('[^a-zA-Z0-9  . , : - _]', '', str(x)))
fuboTVdf_t.text_clean = fuboTVdf_t.text_clean.str.lower()
fuboTVdf_t['vader_sentiment_test'] = fuboTVdf_t.text_clean.apply(lambda x:analyser.polarity_scores(x))
fuboTVdf_t['streamer_sentiment'] = fuboTVdf_t['vader_sentiment_test'].apply(lambda x:sentiment_parser(x))
In [119]:
YouTubeTVdf_t = YouTubeTVdf_t[['Document', 'text_clean', 'streamer_sentiment']]
Huludf_t = Huludf_t[['Document', 'text_clean', 'streamer_sentiment']]
Philodf_t = Philodf_t[['Document', 'text_clean', 'streamer_sentiment']]
Slingdf_t = Slingdf_t[['Document', 'text_clean', 'streamer_sentiment']]
AppleTVdf_t = AppleTVdf_t[['Document', 'text_clean', 'streamer_sentiment']]
DisneyTVdf_t = DisneyTVdf_t[['Document', 'text_clean', 'streamer_sentiment']]
ItsOnATTdf_t = ItsOnATTdf_t[['Document', 'text_clean', 'streamer_sentiment']]
fuboTVdf_t = fuboTVdf_t[['Document', 'text_clean', 'streamer_sentiment']]
In [120]:
YouTubeTVdf_t['streamer'] = 'YouTubeTV'
Huludf_t['streamer'] = 'Hulu'
Philodf_t['streamer'] = 'Philo'
Slingdf_t['streamer'] = 'Sling'
AppleTVdf_t['streamer'] = 'AppleTV'
DisneyTVdf_t['streamer'] = 'DisneyTV'
ItsOnATTdf_t['streamer'] = 'ItsOnATT'
fuboTVdf_t['streamer'] = 'fuboTV'
In [121]:
frames = [YouTubeTVdf_t, Huludf_t, Philodf_t, Slingdf_t, AppleTVdf_t, DisneyTVdf_t, ItsOnATTdf_t, fuboTVdf_t]
In [122]:
df = pd.concat(frames)
In [123]:
doc_complete = list(df.text_clean)
In [124]:
stop_streamer_names = ['youtubetv','hulu','philo','sling', 'appletv','disneytv','itsonatt','fubotv']
In [125]:
stop = set(stopwords.words('english')).union(stop_streamer_names)
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized
In [126]:
doc_clean = [clean(doc).split() for doc in doc_complete]
In [127]:
final_list = []
for i in doc_clean:
    initial_list = []
    for j in i:
        if j not in stop_streamer_names:
            initial_list.append(j)
    final_list.append(initial_list)
In [128]:
doc_clean = final_list
In [129]:
# Creating the term dictionary of our courpus, where every unique term is assigned an index. 
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
In [130]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel
In [131]:
%%time

ldamodel = LdaMulticore(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50) #3 topics
print(*ldamodel.print_topics(num_topics=3, num_words=10), sep='\n')
Wall time: 0 ns
(0, '0.025*"tv" + 0.024*"netflix" + 0.022*"disneyplus" + 0.021*"amazing" + 0.021*"chewy" + 0.021*"twitch" + 0.021*"playstationvue" + 0.021*"hbogo" + 0.021*"freshly" + 0.021*"cooky"')
(1, '0.020*"get" + 0.017*"live" + 0.015*"channel" + 0.013*"via" + 0.012*"day" + 0.010*"free" + 0.010*"staysafe" + 0.010*"covid19" + 0.010*"7" + 0.010*"streaming"')
(2, '0.017*"pre" + 0.017*"game" + 0.017*"supersaturdaynight" + 0.016*"de" + 0.013*"le" + 0.012*"ladygaga" + 0.010*"netflix" + 0.008*"philosophie" + 0.008*"gaga" + 0.008*"katebock"')
In [132]:
%%time

ldamodel = LdaMulticore(doc_term_matrix, num_topics=8, id2word = dictionary, passes=50) #3 topics
print(*ldamodel.print_topics(num_topics=8, num_words=15), sep='\n')
Wall time: 0 ns
(0, '0.031*"game" + 0.031*"pre" + 0.019*"le" + 0.019*"la" + 0.016*"youtube" + 0.015*"gaga" + 0.015*"repost" + 0.015*"siswimsuit" + 0.015*"katebock" + 0.015*"welltoday" + 0.013*"et" + 0.013*"communication" + 0.013*"dans" + 0.013*"nouvelle" + 0.011*"time"')
(1, '0.042*"tv" + 0.035*"netflix" + 0.034*"disneyplus" + 0.034*"amazing" + 0.034*"crackle" + 0.034*"freshly" + 0.034*"gooey" + 0.034*"hbogo" + 0.034*"hbonow" + 0.034*"chocolate" + 0.034*"chip" + 0.034*"baked" + 0.034*"chewy" + 0.034*"cooky" + 0.034*"twitch"')
(2, '0.051*"get" + 0.027*"staysafe" + 0.027*"covid19" + 0.026*"movie" + 0.025*"news" + 0.025*"free" + 0.025*"channel" + 0.024*"sport" + 0.024*"7" + 0.024*"live" + 0.024*"day" + 0.024*"stayhome" + 0.024*"freeyourtv" + 0.024*"global" + 0.024*"nocreditcheck"')
(3, '0.030*"de" + 0.020*"le" + 0.015*"philosophie" + 0.015*"citation" + 0.015*"unext" + 0.012*"bakugannetflixshortstvdirectvitsonattinnovaterailrunnernmrxabqecondevelonmuskthisisabq" + 0.012*"darkus" + 0.012*"httpstwittercombakuganstatus1232393921621319681" + 0.012*"fangzorsecrets" + 0.010*"sur" + 0.010*"youtube" + 0.010*"la" + 0.010*"netflix" + 0.007*"cant" + 0.007*"like"')
(4, '0.025*"netflix" + 0.022*"fox" + 0.022*"sinclair" + 0.018*"streaming" + 0.018*"tv" + 0.017*"directv" + 0.016*"local" + 0.014*"vie" + 0.014*"une" + 0.012*"live" + 0.012*"make" + 0.012*"great" + 0.011*"channel" + 0.011*"coming" + 0.011*"broadcast"')
(5, '0.029*"film" + 0.028*"bakugan" + 0.028*"secret" + 0.022*"via" + 0.019*"newmexicofilmabqfilmofficeitsonattnmdevsabqtechtrueabqdirectvnbcuniversalnetflix" + 0.019*"festival" + 0.017*"submitted" + 0.016*"disney" + 0.016*"artiste" + 0.015*"trailer" + 0.013*"horror" + 0.012*"subscriber" + 0.012*"last" + 0.012*"end" + 0.012*"year"')
(6, '0.021*"friend" + 0.017*"dark" + 0.017*"meet" + 0.015*"best" + 0.014*"le" + 0.013*"video" + 0.012*"new" + 0.012*"yall" + 0.010*"know" + 0.010*"single" + 0.010*"every" + 0.010*"oppsomethingwentwrongserivcedontdownload" + 0.010*"tweet" + 0.010*"expensive" + 0.010*"scamlikely"')
(7, '0.027*"star" + 0.024*"1" + 0.024*"lastmanstanding" + 0.024*"gonna" + 0.013*"watch" + 0.013*"day" + 0.013*"watching" + 0.012*"even" + 0.012*"people" + 0.012*"thats" + 0.012*"good" + 0.012*"family" + 0.012*"know" + 0.012*"hurt" + 0.012*"httpstwittercomappleosophystatus1260294950685741063"')

Topic 1 - Live Events

In [133]:
live_events = df[df.text_clean.str.contains('live|news|channel|next|game|missed|team')]
In [134]:
live_events = live_events[~ live_events.text_clean.str.contains('episode|movie|original|show|film|artiste|channel')]
In [135]:
percent_calc = lambda y:y.sum()/y.count()
In [136]:
live_event_counts = live_events.groupby(['streamer','streamer_sentiment']).agg({'Document':['count']})
In [137]:
live_event_counts
Out[137]:
Document
count
streamer streamer_sentiment
AppleTV negative 200
neutral 200
positive 100
DisneyTV positive 100
Hulu neutral 250
positive 250
ItsOnATT neutral 200
Philo positive 250
YouTubeTV negative 250
neutral 250
positive 250
fuboTV positive 200
In [138]:
live_event_counts.groupby(level=0).apply(lambda x:x / float(x.sum()))
Out[138]:
Document
count
streamer streamer_sentiment
AppleTV negative 0.400000
neutral 0.400000
positive 0.200000
DisneyTV positive 1.000000
Hulu neutral 0.500000
positive 0.500000
ItsOnATT neutral 1.000000
Philo positive 1.000000
YouTubeTV negative 0.333333
neutral 0.333333
positive 0.333333
fuboTV positive 1.000000
In [139]:
cust_percent = live_event_counts.groupby(level=0).apply(lambda x:x / float(x.sum()))
In [140]:
names = cust_percent.index.get_level_values(0)
values = np.square(cust_percent.values)
In [141]:
names
Out[141]:
Index(['AppleTV', 'AppleTV', 'AppleTV', 'DisneyTV', 'Hulu', 'Hulu', 'ItsOnATT',
       'Philo', 'YouTubeTV', 'YouTubeTV', 'YouTubeTV', 'fuboTV'],
      dtype='object', name='streamer')
In [142]:
print('Final Scores:')
print(names[0])
print(1-np.sum(values[0:3]))
print(names[3])
print(1-np.sum(values[3:4]))
print(names[4])
print(1-np.sum(values[4:6]))
print(names[6])
print(1-np.sum(values[6:7]))
print(names[7])
print(1-np.sum(values[7:8]))
print(names[8])
print(1-np.sum(values[8:11]))
print(names[11])
print(1-np.sum(values[11:]))
Final Scores:
AppleTV
0.6399999999999999
DisneyTV
0.0
Hulu
0.5
ItsOnATT
0.0
Philo
0.0
YouTubeTV
0.6666666666666667
fuboTV
0.0
In [143]:
# No sling
live_event_scores = [1-np.sum(values[0:3]), 1-np.sum(values[3:4]), 1-np.sum(values[4:6]),
                     1-np.sum(values[6:7]), 1-np.sum(values[7:8]), np.nan,
                     1-np.sum(values[8:11]),1-np.sum(values[11:])]
In [144]:
live_event_scores
Out[144]:
[0.6399999999999999, 0.0, 0.5, 0.0, 0.0, nan, 0.6666666666666667, 0.0]

Topic 2 - Original Programming

In [145]:
original_programming = df[df.text_clean.str.contains('episode|movie|original|show|film|artiste|channel')]
In [146]:
original_programming = original_programming[~ original_programming.text_clean.str.contains('live|news|channel|next|game|missed|team')]
In [147]:
original_programming_counts = original_programming.groupby(['streamer','streamer_sentiment']).agg({'Document':['count']})
In [148]:
original_programming_counts
Out[148]:
Document
count
streamer streamer_sentiment
AppleTV negative 100
neutral 100
positive 300
DisneyTV negative 100
neutral 100
positive 500
Hulu positive 750
ItsOnATT negative 100
neutral 100
positive 300
Philo neutral 500
In [149]:
cancel_percent = original_programming_counts.groupby(level=0).apply(lambda x:x / float(x.sum()))
In [150]:
cancel_percent
Out[150]:
Document
count
streamer streamer_sentiment
AppleTV negative 0.200000
neutral 0.200000
positive 0.600000
DisneyTV negative 0.142857
neutral 0.142857
positive 0.714286
Hulu positive 1.000000
ItsOnATT negative 0.200000
neutral 0.200000
positive 0.600000
Philo neutral 1.000000
In [151]:
names = cancel_percent.index.get_level_values(0)
values = np.square(cancel_percent.values)
In [152]:
names
Out[152]:
Index(['AppleTV', 'AppleTV', 'AppleTV', 'DisneyTV', 'DisneyTV', 'DisneyTV',
       'Hulu', 'ItsOnATT', 'ItsOnATT', 'ItsOnATT', 'Philo'],
      dtype='object', name='streamer')
In [153]:
values
Out[153]:
array([[0.04      ],
       [0.04      ],
       [0.36      ],
       [0.02040816],
       [0.02040816],
       [0.51020408],
       [1.        ],
       [0.04      ],
       [0.04      ],
       [0.36      ],
       [1.        ]])
In [154]:
print('Final Scores:')
print(names[0])
print(1-np.sum(values[0:3]))
print(names[3])
print(1-np.sum(values[3:6]))
print(names[6])
print(1-np.sum(values[6:7]))
print(names[7])
print(1-np.sum(values[7:10]))
print(names[10])
print(1-np.sum(values[10:]))
Final Scores:
AppleTV
0.56
DisneyTV
0.44897959183673464
Hulu
0.0
ItsOnATT
0.56
Philo
0.0
In [155]:
# no Sling, YouTubeTV, and fudoTV
original_programming_scores = [1-np.sum(values[0:3]), 1-np.sum(values[3:6]), 1-np.sum(values[6:7]), 
                               1-np.sum(values[7:10]), 1-np.sum(values[10]), np.nan, np.nan, np.nan]
In [156]:
original_programming_scores
Out[156]:
[0.56, 0.44897959183673464, 0.0, 0.56, 0.0, nan, nan, nan]

Overall Service

In [157]:
overall_service = df[df.text_clean.str.contains('book|great|like|sister|communication|baby|citation|streaming|still|pense|descartes|chose|plus|tv|view|watch|lineup|dvr|cordcutter|cordcutters|review|subscriber|available')]
In [158]:
overall_service = overall_service[~ overall_service.text_clean.str.contains('live|news|channel|next|game|missed|team|episode|movie|original|show|film|artiste|channel')]
In [159]:
overall_service_counts = overall_service.groupby(['streamer','streamer_sentiment']).agg({'Document':['count']})
In [160]:
overall_service_counts
Out[160]:
Document
count
streamer streamer_sentiment
AppleTV neutral 400
positive 300
DisneyTV neutral 400
positive 400
Hulu negative 500
neutral 250
positive 750
ItsOnATT neutral 400
positive 200
Philo neutral 2250
positive 250
Sling negative 100
positive 300
YouTubeTV neutral 1250
positive 1750
fuboTV negative 100
neutral 400
positive 1000
In [161]:
overall_percent = overall_service_counts.groupby(level=0).apply(lambda x:x / float(x.sum()))
In [162]:
overall_percent
Out[162]:
Document
count
streamer streamer_sentiment
AppleTV neutral 0.571429
positive 0.428571
DisneyTV neutral 0.500000
positive 0.500000
Hulu negative 0.333333
neutral 0.166667
positive 0.500000
ItsOnATT neutral 0.666667
positive 0.333333
Philo neutral 0.900000
positive 0.100000
Sling negative 0.250000
positive 0.750000
YouTubeTV neutral 0.416667
positive 0.583333
fuboTV negative 0.066667
neutral 0.266667
positive 0.666667
In [163]:
names = overall_percent.index.get_level_values(0)
values = np.square(overall_percent.values)
In [164]:
names
Out[164]:
Index(['AppleTV', 'AppleTV', 'DisneyTV', 'DisneyTV', 'Hulu', 'Hulu', 'Hulu',
       'ItsOnATT', 'ItsOnATT', 'Philo', 'Philo', 'Sling', 'Sling', 'YouTubeTV',
       'YouTubeTV', 'fuboTV', 'fuboTV', 'fuboTV'],
      dtype='object', name='streamer')
In [165]:
print('Final Scores:')
print(names[0])
print(1-np.sum(values[0:2]))
print(names[2])
print(1-np.sum(values[2:4]))
print(names[4])
print(1-np.sum(values[4:7]))
print(names[7])
print(1-np.sum(values[7:9]))
print(names[9])
print(1-np.sum(values[9:11]))
print(names[11])
print(1-np.sum(values[11:13]))
print(names[13])
print(1-np.sum(values[13:15]))
print(names[15])
print(1-np.sum(values[15:]))
Final Scores:
AppleTV
0.48979591836734704
DisneyTV
0.5
Hulu
0.6111111111111112
ItsOnATT
0.4444444444444444
Philo
0.17999999999999994
Sling
0.375
YouTubeTV
0.48611111111111105
fuboTV
0.48
In [166]:
overall_service_scores = [1-np.sum(values[0:2]), 1-np.sum(values[2:4]), 1-np.sum(values[4:7]),
                          1-np.sum(values[7:9]), 1-np.sum(values[9:11]), 1-np.sum(values[11:13]),
                          1-np.sum(values[13:15]), 1-np.sum(values[15:])]

Final Perceptual Map Setup

In [167]:
streamers = ['AppleTV', 'DisneyTV', 'Hulu', 'ItsOnATT', 'Philo', 'Sling', 'YouTubeTV', 'fuboTV']
In [168]:
live_event_scores = pd.DataFrame(live_event_scores)
original_programming_scores = pd.DataFrame(original_programming_scores)
overall_service_scores = pd.DataFrame(overall_service_scores)
In [169]:
final_scores = pd.concat([live_event_scores,original_programming_scores,overall_service_scores],axis=1)
final_scores
Out[169]:
0 0 0
0 0.640000 0.56000 0.489796
1 0.000000 0.44898 0.500000
2 0.500000 0.00000 0.611111
3 0.000000 0.56000 0.444444
4 0.000000 0.00000 0.180000
5 NaN NaN 0.375000
6 0.666667 NaN 0.486111
7 0.000000 NaN 0.480000
In [170]:
final_scores.columns = ['Live Events', 'Original Programing', 'Overall Streamers']
In [171]:
final_scores.index = streamers
In [172]:
final_scores = round(final_scores*10,2)
In [173]:
final_scores
Out[173]:
Live Events Original Programing Overall Streamers
AppleTV 6.40 5.60 4.90
DisneyTV 0.00 4.49 5.00
Hulu 5.00 0.00 6.11
ItsOnATT 0.00 5.60 4.44
Philo 0.00 0.00 1.80
Sling NaN NaN 3.75
YouTubeTV 6.67 NaN 4.86
fuboTV 0.00 NaN 4.80

Perceptual Map

In [174]:
from sklearn.manifold import MDS
In [175]:
# due to the NA values for some of the dimensions had to drop those streaming services
final_scores = final_scores.dropna()
In [176]:
embedding = MDS(n_components=2,random_state=2019)
scores_transformed = embedding.fit_transform(final_scores)
scores_transformed.shape
Out[176]:
(5, 2)
In [179]:
import matplotlib.pyplot as plt
from matplotlib.offsetbox import (TextArea, DrawingArea, OffsetImage,
                                  AnnotationBbox)
from matplotlib.cbook import get_sample_data

fig, ax = plt.subplots(figsize=(20, 10))
ax.scatter(scores_transformed[:,0],scores_transformed[:,1])

for i, txt in enumerate(list(final_scores.index)):
    xy = [scores_transformed[i,0],scores_transformed[i,1]]
    if txt == 'Hulu':
        fn = get_sample_data("Hulu.png", asfileobj=False)
        arr_img = plt.imread(fn, format='png')
        imagebox = OffsetImage(arr_img, zoom=1.0)
        imagebox.image.axes = ax
        ab = AnnotationBbox(imagebox,xy,
                        xybox=(-100., -20.),
                        xycoords='data',
                        boxcoords="offset points",
                        pad=0.5,
                        )
        ax.add_artist(ab)
    elif txt == 'DisneyTV':
        fn = get_sample_data("DisneyTV.png", asfileobj=False)
        arr_img = plt.imread(fn, format='png')
        imagebox = OffsetImage(arr_img, zoom=1.0)
        imagebox.image.axes = ax
        ab = AnnotationBbox(imagebox,xy,
                        xybox=(100., -20.),
                        xycoords='data',
                        boxcoords="offset points",
                        pad=0.5,
                        )
        ax.add_artist(ab)
    elif txt == 'Philo':
        fn = get_sample_data("Philo.png", asfileobj=False)
        arr_img = plt.imread(fn, format='png')
        imagebox = OffsetImage(arr_img, zoom=1.0)
        imagebox.image.axes = ax
        ab = AnnotationBbox(imagebox,xy,
                        xybox=(-100., 50.),
                        xycoords='data',
                        boxcoords="offset points",
                        pad=0.5,
                        )
        ax.add_artist(ab)
    elif txt == 'AppleTV':
        fn = get_sample_data("AppleTV.png", asfileobj=False)
        arr_img = plt.imread(fn, format='png')
        imagebox = OffsetImage(arr_img, zoom=1.0)
        imagebox.image.axes = ax
        ab = AnnotationBbox(imagebox,xy,
                        xybox=(-100., -20.),
                        xycoords='data',
                        boxcoords="offset points",
                        pad=0.5,
                        )
        ax.add_artist(ab)
    elif txt == 'ItsOnATT':
        fn = get_sample_data("ItsOnATT.png", asfileobj=False)
        arr_img = plt.imread(fn, format='png')
        imagebox = OffsetImage(arr_img, zoom=1.0)
        imagebox.image.axes = ax
        ab = AnnotationBbox(imagebox,xy,
                        xybox=(100., 70.),
                        xycoords='data',
                        boxcoords="offset points",
                        pad=0.5,
                        )
        ax.add_artist(ab)
plt.xlabel("X axis, Original Programming")
plt.ylabel("Y axis, Live Events")        
plt.show()

7. What are the sampling limitations of this approach?

return to top

Sampling limitations

  • History of Tweets pulled – pulling too few Tweets can provide an incomplete picture of a brand; pulling too many Tweets takes more & more compute
  • We used VADER (Valence Aware Dictionary for sEntiment Reasoning), which reads statements word by word. This may skew the sentiment analysis or provide results that miss broader context as it adds up all the word stems it is aware of and assigns weights to them on a positive to negative scale
    • Human judgement is partially sacrificed when using a tool like this
  • Neutral statements should effectively be fact but VADER is not a fact checker and something like that would need to be updated fairly regularly to incorporate new facts and events

8. Should this approach be relied upon to produce Product Positioning maps (Brand Maps)? Will the resulting Brand Map represent the "true" Brand Health?

return to top

Product positioning maps & brand health

  • Product positioning maps can surely be generated by our analysis in terms of understanding an individual brand’s strengths as well as the strengths (and drawbacks) of competitors.
  • Shortcomings of competitors can be viewed as opportunities to exploit for a brand of interest
  • That said, true “brand health” cannot simply be captured through tweets. Standard “power rankings” as BCG is famous for employing are still relevant (sales, units, visits, ACV, avg spend per household, etc.). So, it is behavioural data combined with perceptual maps as well as traditional sales metrics that can provide a full picture of brand health. Not to mention market share – that is key in terms of understanding a brand’s capture of the addressable market.

9. Should this approach be relied upon to produce Brand Delighters and Disappointers for Brands?

return to top

Approach relied upon to product delighters and disappointers?

  • It can surely be used as one of the tools to understand delighters and disappointers but should not be the only mechanism leveraged.
    • Not everyone voices concerns on Twitter
    • Not everyone has a Twitter
    • There is bias to voice concern as compared to pleasure on review sites
    • A product’s role in a consumer’s life may change/evolve so the correct / most important dimensions need to be updated and analyzed
    • It cannot replace traditional controlled studies
      • For example, streaming service perception can be partially limited by one’s internet connection

10. Will the results be reliable?

return to top

Result reliability

  • Results should be reliable if using the same sentiment analyzer / analysis method, whether it be a rule-based or feature-based or embedding-based approach
    • TextBlob – rule-based API
    • VADER
    • SVMs
    • Logistic regression
  • Reliability may come into question when considering recent events, including outages, as well as the history of data used; continuing studies need to be structured appropriately and analyzed in a consistent fashion
  • Results can vary across segments of the population – just because two companies are streaming TV or service providers, doesn’t mean they are direct competitors
    • The overlay of customer panels and household specific spending patterns should be considered
    • Furthermore, the distinction between the customer vs. consumer ought to be made – the one purchasing the service might not be the one consuming the content – we need to capture both perspectives
    • If major shows change platforms, that needs to be considered as it could negatively/positively skew perception
    • Unique respondents need to be considered – are there a handful of Tweets consistently coming from the same people?
      • That also begs the question of whether or not the Tweets are authentic – Twitter “farms” exist to push out content in masses and can be used to sway opinion